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This technical report provides the supplementary material for a paper entitled "Information based 
clustering," to appear shortly in Proceedings of the National Academy of Sciences (USA). In 
Section we present in detail the iterative clustering algorithm used in our experiments and 
m Section [HI we describe the validation scheme used to determine the statistical significance of 
our results. Then in subsequent sections we provide all the experimental results for three very 
different applications: the response of gene expression in yeast to different forms of environmental 
stress, the dynamics of stock prices in the Standard and Poor's 500, and viewer ratings of popular 
movies. In particular, we highlight some of the results that seem to deserve special attention. All 
the experimental results and relevant code, including a freely available web application, can be 
found at fhttp://www. genomics. princeton. edu/biophysics-theory\ 
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many choices at different levels of the analysis. In recent 
work we suggest that some generality can be achieved 
through the use of information theory Q). Here we re- 
view this formulation briefly and then proceed to the 
technical details of its implementation that were left out 
of Ref (1). 

We formulate clustering as a tradeoff between maxi- 
mizing the mean similarity of elements within a cluster 
and minimizing the complexity of the description pro- 
vided by cluster membership. Thus if we have some sim- 
ilarity measure s(i,j) between elements i and j, optimal 
clustering is a probabilistic assignment to clusters C ac- 
cording to P{C\i) such that we maximize 



T^{s)-TIiC;i), 



(1) 



where (s) is the mean similarity of elements chosen at 
random out of each cluster, 



{s) =J2P{C)J2P{i\C)P{i\C)s{i,i), 



(2) 



and I{C; i) is the information that clusters provide about 
the identity of their elements, 



C i 

as usual we have 

P{i\C) = P{C\i)P{i) 
P{C) 



^(gli) 

P{C) 



P{C) 

J2p{c\i)p{i), 



(3) 

(4) 
(5) 



I. THE ICLUST ALGORITHM 

Although clustering is a widely used method of data 
analysis and exploration, there is at present no unique 
or universal mathematical formulation of the clustering 
problem. In practice, clustering a given data set involves 



and since in many cases all examples i occur with equal 
probability [P{i) — ^/N] we consider this case for sim- 
plicity, although it is not essential. This formulation can 
be generalized to handle similarity measures defined on 
groups of more than two elements but here we con- 
centrate on the conventional case where only pairwise 
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interactions are considered. Importantly, as opposed to 
using a problem specific similarity measure s, we use 
the generality of information theory once more, and take 
s(i,j) to be the pairwise mutual information between 
the observed patterns that correspond to data items i 
andj 

It is shown in Ref fP) that any stationary point of our 
target functional, T ^ must obey 

^^^'^^ = ^P) { \ } ' 

where Z{r,T) is a normalization function, s(C;i) is the 
expected similarity between i and a member of cluster C, 

N 

s(C;i) = ^P(ii|C)s(ii,i), (7) 
ii=i 

and s(C) is the average similarity among pairs chosen 
independently out of the cluster C, 

N N 

^(^) = EE^(iil^)^(i2|C)s(ii,i2). (8) 

ii — 1 i2 — 1 

Eq. © defines an implicit set of equations since the right 
hand side depends on P(i|C) and P{C). This is a com- 
mon situation in variational methods, also present, for 
example, in conventional rate-distortion clustering (Q), in 
maximum likelihood estimation with hidden variables (0) 
and in the Information Bottleneck framework (jsl). The 
standard strategy is to turn the self-consistency condi- 
tion into an iterative algorithm. Specifically, let us de- 
note the intermediate solution of the algorithm at the 
m*^ iteration by P^"^\C\i). Then, at the m + P* itera- 
tion, the algorithm applies the following update rule: 

p(m+i) (c|i) ^ (C) exp I ^ [2s(") (C; i) - s^") (C)] | 

(9) 

followed by a normalization step. Notice that the terms 
{P(™)(C), s(™)(C;i), s(")(C)} ah are calculated using 
p(™)((7|i). Pseudo-code for this algorithm is given in 
Figure ^ It is easy to verify that with a straightfor- 
ward implementation, the complexity of this algorithm 
is 0{N^ ■ Nc) for a single pass over the entire data set, 
where is the number of clusters. We will refer to this 
algorithm as the Iclust algorithm. 

To gain some intuition let us consider a typical situ- 
ation where i is relatively similar to elements in C, but 
very different from elements in C". Thus, the exponent 



^ This report does not deal with the technical details of estimating 
mutual (and multi) information from empirical data. The reader 
is referred to Q) for a complete description of the estimation 
procedure used in our experiments. 



in Eq ® will be positive for i and C, but might be neg- 
ative for i and C . Consequently, while applying the up- 
date step the weight of assignment of i to C [P(C|i)] will 
be boosted while its assignment to C" will be decreased. 
This is clearly a desirable outcome, which in particular 
should increase J-'. Thus, since is upper bounded (as a 
sum of information terms), after a finite number of such 
updates the algorithm is expected to converge to a fixed 
point which corresponds to a (possibly local) maximum 
of J'. 

This example also illustrates one of the differences be- 
tween our algorithm and previous approaches. While in 
the Blahut-Arimoto algorithm in rate-distortion theory 
13), in the iterative Information Bottleneck algorithm 
and typically also in EM for maximum likelihood Q) , the 
sign of the exponent is constant (for a given i) , this is not 
true in our case. In principle, such a non-constant ex- 
ponent sign should imply faster convergence to a local 
stationary point, but might also imply higher sensitivity 
to the random initialization of P{C\i). Thus, as in other 
work, we typically perform several runs with different 
random initializations of P(C|i) from which we choose 
the best solution, i.e., the one that maximizes J-'. 

The Iclust algorithm presented here uses a sequential, 
or incremental iterative procedure in which the updates 
for some i incorporate the implications of the updates for 
preceding elements, i' ^ i. As a simple example, consider 
the case where we have three elements {N — 3) and two 
clusters {Nc = 2). We start from some random condi- 
tional distribution matrix, p(*''(C|i), which in particular 
defines s'"' (C), V C = 1 : 2. At the first iteration we find 
a new distribution for the first element (i = 1) over the 
two clusters. Thus, we now have a new conditional distri- 
bution matrix, p(-'^'(C|i) which differs from the previous 
P'-°^C\i) only by its first row. This distribution is used 
to define s'"'^'(C), V C = 1 : 2. Now, in the next iteration, 
we find a new distribution for the second element (i = 2) 
over the two clusters. This yields another new condi- 
tional distribution matrix, P'-^)(C|i) which differs from 
the previous P^^^(C|i) only by its middle row, and so on. 
This process is somewhat in the spirit of the incremental 
EM ^ and the sequential Information Bottleneck algo- 
rithm (0). An alternative optimization routine, which 
seems less natural in our case, would be parallel opti- 
mization, used e.g., in standard EM (Q). In this case, if 
we continue our example, at the first iteration we will up- 
date all the rows in the conditional distribution matrix, 
p("'(C|i) using s(°)(C), to find the new p(i'(C|i). 

In some extreme cases the above algorithm might pro- 
duce a non-monotonic behavior in J-'. That is, some 
of the updates might reduce suggesting that obtain- 
ing a general proof of convergence is a challenging goal. 
Nonetheless, even in these extreme cases, and more gen- 
erally in all our experiments (which included more than 
1000 runs over real world problems with different T val- 
ues and different numbers of clusters), the algorithm al- 
ways converged to a stationary point. Moreover, for the 
regime T > maxi-^^i^s {11,12) it is possible to prove this 
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convergence analytically (the details will be presented 
elsewhere) . 

II. EVALUATING CLUSTERS' COHERENCE 

The central question in clustering is whether an essen- 
tially unsupervised analysis of a data set can recover cat- 
egories that have "meaning." In practice we assess this 
by comparison with some set of labels for the data that 
were generated by human intervention. To get started, 
then, we need a set of annotations (or labels) provided 
for every data item we clustered. Importantly, these an- 
notations are not used during the clustering process but 
rather are exposed only for the post-clustering valida- 
tion. Every data item might be assigned more than one 
annotation via different sources of information. The as- 
sumption is that these annotations reflect to some extent 
the "real" structure of the data that one wishes to reveal 
through the clustering process. 

To be more concrete, let us assume that we clustered 
A'' elements, where each one of these elements is assigned 
some set of annotations. Formally, this could be rep- 
resented through an annotation matrix, denoted as A, 
with N rows and R columns, where R is the number of 
distinct annotations in our data. Thus, A(i, j) = 1 if and 
only if the i-th clement is assigned annotation aj, and 
zero otherwise. A simple example is given in Table 

When we examine a single cluster, consisting of n < TV 
elements, the first question we might ask is whether some 
annotations occur in this cluster with a "suspiciously" 
high frequency. Let us consider a specific annotation aj 
that is assigned to K < N elements in the entire popula- 
tion and to X < n elements in the cluster. The probability 
of this event, under the null hypothesis that elements are 
assigned to clusters at random, is given by the hyperge- 
ometric distribution: 

Phype.(a;|n,j^,jV)^ ^-^^"-^ . (10) 

The corresponding P-value is defined as the tail of this 
distribution: 

Pval{x\n,K,N)= ^ Pi,yp,,{x' \n, K, N) . (11) 

x'—x 

In words, it is the probability of observing x or more ele- 
ments in the cluster with annotation Oj where the mem- 
bers of the cluster are chosen independently of this an- 
notation. Alternatively, it is the probability of wrongly 
rejecting the hypothesis that the cluster has nothing to 
do with the annotation oj. The smaller the P-value the 
more unlikely this null hypothesis becomes. To gain some 
intuition, several examples are presented in Table UTI 

Having defined the statistical significance of a single 
event we need to bear in mind that in a single cluster 
one typically observes several (perhaps many) different 



annotations. Naturally, the more hypotheses one tests 
the less surprising it is to find one with a small P-value, 
even in a randomly chosen cluster. The simplest and 
most conservative approach to correct for this multiple 
hypotheses testingeffect is to apply the Bonferroni cor- 
rection {see, e.g., (Q)). Specifically, if the statistical sig- 
nificance level is q {e.g., q = 0.05), an event is considered 
significant if and only if its P-value satisfies: 

Pval < ^ , (12) 

where H is the number of hypotheses being tested. We 
will say that a cluster is enriched with the annotation if 
the corresponding P-value satisfies Eq. (|12l) . 

Finally, while the above procedure determines the sig- 
nificance of every annotation that occurs in the cluster, it 
also is useful to have a single score that roughly summa- 
rizes how homogeneous the cluster is with respect to all 
annotations. Different alternatives have been proposed 
to this end and here we use the coherence score, sug- 
gested by Segal et al. Q), 

coh{C) = 100 ■ ^^""^^^^ , (13) 
n 

where n is the number of items in the cluster C , and 
nenriched IS the number of items in C with an annotation 
that was found to be significantly enriched in C . In other 
words, the coherence of a cluster is simply the percentage 
of the cluster's elements covered by some annotation that 
was found to be enriched in that cluster. In particular, 
a coherence value above zero means that at least one 
annotation is enriched in the cluster, namely that there 
is at least a single hint regarding the reason for forming 
this cluster. 



III. FIRST APPLICATION: THE YEAST ESR DATA 
A. Description of the data 

We considered experiments on the response of gene ex- 
pression levels in yeast to various forms of environmen- 
tal stress l|To|) . Previous analysis of expression patterns 
from all ~ 6000 genes identified a group of 283 stress- 
induced and 585 stress-repressed genes that had appar- 
ently "nearly identical but opposite" expression profiles 
ifiol) . This collection of 868 genes was thus termed the 
yeast environmental stress response (ESR) module. As 
seen in Figure |21 differences in expression profiles within 
the ESR module indeed are relatively subtle. More recent 
manual analysis with attention to background biological 
data suggests that some of these differences are biolog- 
ically significant l(TT|) . Thus, it seems a good challenge 
for our approach to ask if we can discover automatically 
any meaningful substructure in these data. 

Each of the 868 ESR genes was repre- 
sented by its log-ratio expression profile in the 
173 microarray experiments ijlOj) . available at 
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^ttp://genome-w'ww. Stanford. edu/yeast_stress/data.shtml 
From these data we estimated all the ~ 376, 000 pairwise 
mutual information relations /y, as described in (0), 
ending up with a 868 x 868 matrix which defined the 
input to our clustering procedure. For convenience, we 
provide here some statistics of the estimated mutual in- 
formation values. For a complete description, including 
different verification schemes that support the reliability 
of our estimates, the reader is referred to Q) . 

Across all pairs of genes, the average estimated mutual 
information was 0.48 bits with a variance of 0.0425 bits^. 
This relatively high average value corresponds to the 
strong positive/negative linear correlations known to be 
present in these data. Almost 7, 000 pairs had a mutual 
information greater than 1 bit, and the maximal esti- 
mated mutual information was 1.58 bits. All the pairwise 
mutual information relations are presented in Figure 13 
where the genes are sorted according to the clustering 
partition into Nc — 20 clusters that we analyze in detail 
(see below). The diagonal elements of this matrix, or 
self-information were set to la = log2(5), the maximal 
possible information under a quantization into five bins 



increases but so does /(C;i). In addition, the solutions 
become more deterministic. For example, for Nc = 20 
and l/T — 15, only ~ 44% of the genes have nearly de- 
terministic assignment [i.e., P{C\i) > 0.9 for a particular 
C]. For l/T — 25 this percentage grows to ~ 85%. 

The entire continuum of solutions, represented by the 
trade-off curves, may encompass a lot of insights about 
the data. Nonetheless, for brevity, we focus our analysis 
on solutions for which the saturation of (s) is relatively 
clear, i.e., on the four solutions with Nc = {5, 10, 15, 20} 
and l/T — 25. In all these partitions most of the genes 
(between 75% to 85%) had a nearly deterministic assign- 
ment [P(C|i) > 0.9 for a particular C]. In the rest of 
the analysis we treat these solutions as hard {i.e., deter- 
ministic) partitions where every gene is assigned solely to 
its most probable cluster. In the next section we explore 
the possible hierarchical relations between these four so- 
lutions. In later sections we analyze in detail the specific 
solution with {Nc = 20, l/T = 25} that obtained the 
highest value of (s). 



C. Comparing solutions at different numbers of clusters 



B. Quality-complexity trade-off curves 

Given the pairwise mutual information matrix we ap- 
plied the Iclust algorithm described in Section Recall 
that our target functional, J-, is given by: 

T^{s)-TI{C;i), (14) 

where T is a (temperature) trade-off parameter, 

We 

(.) = ^P(C)^P(i|C)P(j|C)/u (15) 

C=l ij 

measures the quality of the clusters, and I{C; i) measures 
the cost of coding cluster identity. 

For a fixed number of clusters, Nc, the term (s) grad- 
ually saturates as the temperature T is lowered, while 
I{C; i) increases accordingly. We explored this trade-off 
for different numbers of clusters: Nc = 5, 10, 15, 20. For 
each of these values we tried several values of T; we found 
that l/T ^ {5, 10, 15, 20, 25} typically suffices to obtain 
a relatively clear saturation of (s), hence we present the 
results for these T values. 

For each {Nc , T} pair we performed 10 different ran- 
dom initializations, ending up with 10 (possibly) different 
local maxima of JT, from which we chose the best one. 
The resulting trade-off curves are presented in the left 
panel of Figure 0] For a given Nc, as T is lowered, (s) 



^ This log transformation has no effect on our analysis since the 
mutual information is invariant to such transformations H). 



A common dichotomy in the cluster analysis literature 
is between hierarchical and non-hierarchical, or parti- 
tional clustering algorithms [see, e.g., Ref IB)]. What is 
often missed, though, is the fact that applying a hierar- 
chical clustering algorithm typically enforces the output 
to be of a hierarchical nature, regardless of whether the 
data indeed call for this view. For example, applying 
an agglomerative clustering algorithm to the ESR data 
will produce, by definition, a nested tree-like hierarchy 
of partitions, although a priori it is not obvious whether 
the functional classification of these genes should be of a 
hierarchical nature. 

Because our approach is not constrained to hierarchi- 
cal structures, the emergence of even an approximate hi- 
erarchy would be a genuine result. To test for this, we 
start with several solutions that were found independently 
at different numbers of clusters and ask to what extent 
these solutions form a hierarchy. This is done in two 
steps. First, for every cluster we identify its best parent 
in the next (less detailed) level. Specifically, if C is some 
cluster at a partition with Nc clusters, then its best par- 
ent in a less detailed partition with N'c < Nc clusters will 
be the one that includes the maximal number of C mem- 
bers. Second, we measure how well this parent includes 
its child and represent the result through the type of the 
edge that we draw between the two clusters. 

The hierarchical graph produced by this scheme is dif- 
ferent from the standard output of hierarchical clustering 
algorithms in several aspects. To start, a cluster might 
have more than one parent if its members are equally dis- 
tributed among several clusters in the less detailed parti- 
tion. Next, a cluster might have no children if it is not the 
best parent of any cluster at the more detailed partition. 
Last, the characteristics of the edges convey further in- 
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formation regarding how well the independent solutions 
form a hierarchy. In particular, a graph with many high 
quality inclusion edges is a good indication that the data 
are hierarchical in nature. In contrast, a graph in which 
many of the inclusions from one level to the other are 
only partial suggests that a hierarchical view of the data 
is somewhat misleading. 

We applied this scheme to the four solutions we ob- 
tained independently for Nc = {5, 10, 15, 20} with 1/T = 
25. The results are presented in Figure|5| As can be seen 
in the figure, the independent solutions form an approx- 
imately hierarchical structure. Interestingly, some func- 
tional modules are better preserved than others across 
the different levels. For example, the ribosome cluster, 
cl8, clearly is identified at all the independent solutions. 



D. Coherence results 

1. Constructing the annotation matrices 

As already mentioned, clusters' coherence is estimated 
with respect to a given annotation matrix. For yeast 
genes, different sources of information may provide these 
data. One important resource is the Gene Ontology (GO) 
database {3) , which is the one that we use in this work; 
specifically, we used the December 2003 version. 

The GO database consists of three structured ontolo- 
gies (controlled vocabularies) that describe gene products 
in terms of their associated Biological Processes {GO bp), 
Molecular Functions (GOmf), and Cellular Components 
{GOcc)- For each of these three ontologies we con- 
structed a corresponding annotation matrix. Thus, for 
example, if A^p is the matrix constructed for the GOb p 
ontology then ABp(i,j) = 1 if and only if the i-th gene 
in our data is assigned to the j-th biological process. A 
small subset of this annotation matrix is presented in 
Table Hm 

Each of the GO ontologies is organized in a hierarchi- 
cal manner where more specific annotations correspond 
to nodes which are more distant from the ontology root. 
This might yield evaluation difhculties if one considers 
only the particular GO terms with which a gene is an- 
notated l(l3) . An example is given in Table HVl Here, 
several genes that were all assigned to the same cluster 
are annotated with different specific GO bp terms, and 
their functional relationship becomes evident only if one 
notices that all these annotations have a common (more 
general) ancestor in the ontology. We applied a standard 
routine to overcome this difhculty, in which every gene 
was assigned not only its direct GO annotations but also 
all the ancestors of these annotations in the GO hier- 
archy. This is consistent with the GO organization, in 
which if a GO term describes some gene product then all 
its parent terms in the ontology also apply to that gene 
product. 

Last, while estimating clusters' coherence we removed 
annotations that were assigned to less than two genes 



in our data, since these annotations obviously can not 
be enriched in any cluster. We also removed from the 
analysis genes that were not assigned any annotations, 
or assigned the unknown annotation. The details of the 
resulting annotation matrices are given in Table IVl 



2. Coherence results and comparisons 

We estimated the statistical coherence of the clusters 
obtained at the low-temperature end of the trade-off 
curves where \/T — 25. This coherence was estimated 
with respect to each of the three Gene Ontologies. To 
gain some perspective, we applied similar analysis with a 
recent release of the Cluster software, called Cluster 3.0 
1(15^) . This software is considered to be a state-of-the- 
art (and quite popular) tool for cluster analysis of gene 
expression data. We experimented extensively with all 
the basic algorithms available in this package. These in- 
clude two different variants of iterative if-means cluster- 
ing (K-means and K-medians) and four different vari- 
ants of hierarchical clustering [Complete linkage, Aver- 
age linkage, Centroid linkage, and Single linkage). With 
each of these algorithms we tried three standard similar- 
ity measures: the Pearson correlation ("centered corre- 
lation") ©, the absolute value of the Pearson correla- 
tion, and the Euclidean distance. Thus, altogether we 
compared our performance to 18 different configurations 
of this software which are probably the most commonly 
used configurations. For the six K-means variants we 
tried 100 different random initializations in every run, 
from which the best solution (the one with the smallest 
sum of within-cluster distances) was chosen. The com- 
parison was undertaken at all the different numbers of 
clusters, Nc = 5, 10, 15, 20. The results are summarized 
in Table IVII to Table IIXI The average results are given 
Figure El 

In all cases the Iclust algorithm was clearly superior to 
all of the 12 hierarchical algorithms we tried. It should be 
stressed that these algorithms are considered a powerful 
tool for analyzing genomic datasets, and many published 
applications are based on this type of hierarchical anal- 
ysis. Nonetheless, standard hierarchical clustering typi- 
cally failed to see a significant substructure in the ESR 
module. In most cases Iclust was also superior to the 
average performance of the six K-means variants, and 
in some cases (e.g., Nc = 5) it was in fact superior to 
all the X-means variants. Averaging over all three Gene 
Ontologies and over all four Nc values, Iclust obtains a 
coherence of ^ 56% while the average if -means coher- 
ence is ~ 42% and the average Hierarchical coherence is 
- 12%. 

We further repeated this comparison with all the com- 
peting algorithms while considering the log2 ratios of 
expression profiles as input, instead of the raw ratios. 
Even with this preprocessing (to which our approach is 
invariant), the Iclust average performance is superior to 
almost all the 18 alternatives, typically by a significant 
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margin. Specifically, when averaging over all three Gene 
Ontologies and over all four Nc values, the average K- 
means coherence is ^ 52% and the average Hierarchical 
coherence is ~ 19%. While there exists some intuitive 
motivation for the log2 preprocessing, there is no formal 
justification. Clearly, from a principled point of view, an 
approach which is invariant to such transformations is 
preferable, even if it were to generate only comparably 
good results. 



E. Detailed results for Nc = 20 clusters 

In Table Ixl we present all enriched annotations for the 
Iclust partition with Nc ~ 20 clusters and 1/T — 25. 
Further examination of these clusters yields several ob- 
servations that allow us to see in more detail what makes 
these clusters meaningful solutions to our problem. 

First, in several cases the extracted clusters consist 
of genes from both the nominal induced and repressed 
groups. For example, c5 consists of 26 induced genes 
(enriched with oxidoreductase activity) and 6 repressed 
ones. In Figure l^l'V we see that the genes in this cluster 
have a relatively augmented response under Menadione 
exposure and a relatively reduced response in a station- 
ary phase, as opposed to genes not in this cluster. 

In Figure [7|3 we display the average behavior of the 
22 induced genes in c8 versus the 49 induced genes in 
cl9 in two opposing temperature shifts. Although all are 
induced by heat, the genes in cl9 are more sensitive to 
this treatment which is consistent with the enrichment of 
heat shock protein activity in this cluster. 

Cluster cl8 consists of 122 repressed genes which were 
mainly ribosomal proteins. In Figure [7p we see that the 
genes in this clusters exhibit a distinguished transient 
expression pattern under, e. 17., Diamide treatment, a fact 
that was already mentioned in l(llT) . On the other hand, 
cluster cl6 consists of 87 repressed genes and is enriched 
for ribosome biogenesis and other related annotations. 
In the same figure we see that this cluster exhibits an- 
other distinctive behavior with respect to the rest of the 
repressed genes. 

In Figure we consider again two clusters, c2 and c7, 
which seem to involve ribosomal proteins and ribosome 
biogenesis, respectively. As seen in the figure, when the 
cells converge to a quiescent state under Nitrogen deple- 
tion, these two clusters exhibit quite different behaviors. 

In Figure [TJil we see another intriguing behavior of two 
clusters, cl5 and cl7, under steady-state growth at dif- 
ferent temperatures. From the GO annotations we find 
that cl5, which includes 12 repressed genes, is enriched 
for tRNA aminoacylation, while cl7 which includes 7 re- 
pressed genes is enriched with cell cycle related annota- 
tions. Figure [7p demonstrates that the distinction be- 
tween these two clusters is not spurious, as they display 
different behaviors, e.g., in response to hyper-osmotic 
shock. 

As two complementary validation schemes we used the 



regulator-promoter region interactions reported in fl^Tj)^ 
and the DNA-binding sequence motifs provided in 1.18).^ 
In most of our clusters we found enrichment of regulatory 
interactions and/or known sequence motif in the corre- 
sponding upstream sequences (Pval < 0.05, Bonferroni 
corrected). For example, c5, cl9, and cl7 were enriched 
for YAPl, HSFl, and MBPl, respectively. As YAPl is 
known to be involved with response to oxidative stress, 
HSFl with response to heat, and MBPl with cell cy- 
cle regulation, these enrichments are clearly in consistent 
with the GO enrichments for the same clusters. cl8 and 
c2 (Figs. 4C,D) were enriched with FHLl which is re- 
quired for rRNA processing, and cl8 was further enriched 
with RAPl - known to be involved in regulating ribo- 
somal proteins, and with four other regulators (GATS, 
YAP5, PDRl, and RGMl), suggesting similar, yet not 
identical regulatory programs for these two functionally 
related clusters. cl6 was enriched for ABFl and both 
c7 and cl6 were enriched with several motifs which are 
known to be related to rRNA processing and synthesis, 
consistently with the GO annotations enriched for these 
clusters. 



F. A cluster enriched with uncharacterized genes 

In the statistical validation of our clusters (Sec- 
tion inm we removed from the analysis uncharacterized 
genes. Nonetheless, the distribution of the uncharacter- 
ized genes among our clusters yields an intriguing result. 
One might have suspected that almost every process in 
the cell has a few components that have not been identi- 
fied, and hence that as these processes are regulated there 
would be a handful of unknown genes that are regulated 
in concert with many genes of known function. For at 
least one of our clusters, our results reveal a different 
picture. 

Given the fraction of uncharacterized genes in a cluster 
and the corresponding fraction at the entire population, 
one can use the hypergeometric distribution to calculate 
a P- value for this event (see Section ^J- Applying this 
to our partition into Nc = 20 clusters we find that c7 is 
significantly enriched with genes that are uncharacterized 
in the GO bp and GOmf ontologies. 

Specifically, out of the 123 genes in c7, 72 have an 



^ In these data, every gene is "annotated" with 106 "P-value" 
scores that determine the probability of this gene being regulated 
by each of 106 yeast transcriptional regulators. By considering 
only interactions with a P-value lower than 0.005 we constructed 
out of these data an annotation matrix with 868 (gene) rows, 106 
(regulator) columns and a total 1307 predicted interactions. 

* Here, again, one can construct an annotation matrix where 
A(i,j) = 1 if and only if the 1,000 base-pair promoter sequence 
of the i-th gene includes the j-th motif. After considering only 
the 100 most frequent motifs we ended up with an annotation 
matrix, with 868 (gene) rows, 100 (motif) columns and 19, 517 
predicted interactions. 
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unknown molecular function. This level of concentra- 
tion has a (P-value) probability of ~ 10~* to have arisen 
by chance. Moreover, if we consider only the repressed 
genes in the ESR module (since c7 consists mainly of such 
genes), we see that 69 out of the 114 repressed genes in 
c7 are uncharacterized in the GOmf ontology, which has 
a P-value of - 10-^^ 

Closer examination of the GO bp characterized genes 
in the same cluster reveals several enriched annotations 
(see Table |XJ related to ribosome biogenesis and ribo- 
somal RNA processing, suggesting that most of the pre- 
viously unannotated genes in this cluster are involved 
in these processes as well. Nonetheless, the extremely 
high concentration of uncharacterized genes in this clus- 
ter suggests that these genes are involved with biologi- 
cal processes which are harder to detect and characterize 
with the current technologies. 

Finally, it is also worthwhile to point out that the clus- 
ter c7 is extremely conserved when one tries to find parti- 
tions with a smaller number of clusters, as demonstrated 
in FigurelU In fact, all the parent clusters of this c7 clus- 
ter (for Nc = 5, 10, 15) were similarly enriched for GO bp 
and GOmf uncharacterized genes. 

IV. SECOND APPLICATION: THE SP500 DATA 
A. Description of the data 

In our second application we consider a very differ- 
ent data set, the companies in the Standard and Poor's 
500 list. We used the May 2004 hsting of the 500 compa- 
nies, available at http://www.standardandpoors.com For 
these companies we examine the day-to-day fractional 
changes in stock price during the trading days between 
December 2, 2002, and December 31, 2003, (a total of 
273 trading days), as seen in Figure|Sl^ 

From these data we estimated all the ~ 125, 000 mu- 
tual information relations, as described in |2), ending up 
with a 500 x 500 matrix /y which, as before, defines the 
input to our clustering procedure. For convenience, we 
provide here some statistics of the estimated mutual in- 
formation values. For a complete description, including 
different verification schemes that support the reliability 
of our estimates, the reader is referred to Q). 

Across all pairs of companies, the average estimated 
mutual information was 0.10 bits with a variance of 
0.0054 bits^, and the maximal estimated mutual informa- 
tion was 0.97 bits. All the pairwise mutual information 



^ These data are available at http://wrds.wharton.upenn.edu We 
identified the different companies by their ticker symbols as re- 
ported in http://www.standardandpoors.com However, these 
symbols are not unique, and as a result the database at 
http://wrds.wharton.upenn.edu returned slightly more than 500 
companies; for 501 of these the data were available for the en- 
tiro 2003 year, hence these are the companies we consider in our 
analysis. 



relations are presented in Figure El where the compa- 
nies are sorted according to the clustering partition into 
Nc — 20 clusters that we analyze in detail (see below). 
The self-information relations were set to la = log2(5) 
which corresponds to the maximal possible information 
under a quantization into five bins (j2)- 



B. Quality-complexity trade-off curves 

Given the pairwise mutual information matrix we ap- 
plied the Iclust algorithm described in Section HJ As 
in the first application, we explored the trade-off be- 
tween (s) and /(C;i) for different numbers of clusters: 
Nc — 5, 10, 15, 20 and for different values of the trade- 
off parameter, T. Specifically, we found that 1/T = 
{15,20,25,30,35} were typically sufhcient to obtain a 
relatively clear saturation of (s), hence we present the 
results for these T values. For each {Nc , P} pair we per- 
formed 10 different random initializations ending up with 
10 (possibly) different local maxima of T, from which we 
chose the best one. The resulting trade-off curves are 
presented in the middle panel of Figure 01 

As before, as T is lowered, (s) increases but so does 
/(C;i). In addition, the solutions become more deter- 
ministic. For example, for Nc = 20 and 1/P = 25, only 
^ 36% of the companies have nearly deterministic as- 
signment [P(C|i) > 0.9 for a particular C]. On the other 
hand, for 1/P = 35, all the assignments are nearly deter- 
ministic [P(C|i) > 0.9]. 

For brevity, we focus our analysis on solutions for 
which the saturation of (s) is relatively clear, i.e., on the 
four solutions with Nc = {5,10,15,20} and 1/P = 35. 
In all these partitions almost all of the companies had a 
nearly deterministic assignment [P(C|i) > 0.9 for a par- 
ticular C], so we treat these solutions as hard partitions 
where every company is assigned solely to its most prob- 
able cluster. 



C. Comparing solutions at different numbers of clusters 

We examine directly how well our independent so- 
lutions form a hierarchical structure. Accordingly, we 
apply exactly the same scheme as described in Sec- 
tion llll.Cl to the four solutions we obtained independently 
for Nc = {5, 10, 15, 20} with 1/P = 35. The resuhs are 
presented in Figure ^| Again, the independent solu- 
tions form only an approximate hierarchy. Nonetheless, 
this approximation seems more suitable in this 
demonstrated, e.g., by the larger percentage of nearly per- 
fect inclusion relations (solid bold edges in the figure). 
It should be noted that indeed the standard classifica- 
tion of these companies is hierarchical in nature (see Sec- 
tion |IVEI1)- 

Again, it is worthwhile to point out that some of the 
clusters are better preserved than others across the dif- 
ferent levels. For example, the Semiconductor Equipment 
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cluster, cll, is clearly identified in all the independent so- 
lutions. 



D. Coherence results 

1. Constructing the annotation matrices 

We used the Global Industry Classification Standard 
(GICS), which classifies companies at four different lev- 
els: sector, industry group, industry, and sub-industry 
(see http://www.standardandpoors.com). These four lev- 
els are arranged in a well defined, tree-like hierarchy. 
The bottom (sector) level consists of 10 different annota- 
tions: Consumer Discretionary, Consumer Staples, En- 
ergy, Financials, Health Care, Industrials, Information 
Technology, Materials, Telecommunication Services, and 
Utilities. The next (industry group) level consists of 24 
distinct annotations. The next (industry) level consists 
of 59 distinct annotations. The last (sub-industry) level 
consists of 114 distinct annotations. Thus, altogether 
there are 207 different annotations where every company 
is assigned exactly four annotations, one at every level of 
the hierarchy. 

As in the first application, while estimating clusters' 
coherence we removed annotations that were assigned to 
less than two companies in our data, ending up with a 
total of 178 distinct annotations. 



2. Coiierence results and comparisons 

We estimated the coherence of the clusters obtained at 
the low-temperature end of the trade-off curves where 
l/T = 35. To gain some perspective, we applied a simi- 
lar analysis to the results obtained with the Cluster 3.0 
software ifTsI) . We experimented with the same 18 basic 
configurations as in the previous application (if-means 
variants, again with 100 different initializations), and ap- 
plied the comparison to all the different numbers of clus- 
ters we examined, Nc = 5, 10, 15, 20. The results are 
summarized in Table Ixj to Table IXTVI and in Figure ITTI 

In all cases, Iclust was superior to the average perfor- 
mance of the X-means and the hierarchical Cluster 3. 
variants. In fact, except for the iiT-medians configura- 
tions, none of the other algorithms came even close to 
the Iclust performance. Averaging over all four Nc val- 
ues, Iclust obtains an average coherence of ~ 90% while 
the average i^T-means coherence is ~ 79% and the aver- 
age Hierarchical coherence is only ~ 19%. 

It is interesting to point out that although the anno- 
tations for these data are arranged in a relatively simple 
and clear hierarchical structure, the performance of the 
hierarchical algorithms are still relatively poor, perhaps 
due to the greedy nature of these optimization routines, 
which typically yield suboptimal solutions. 



E. Detailed results for Nc — 20 clusters 

In Table IXVI we present all enriched annotations for 
the Iclust partition with Nc = 20 clusters and l/T ~ 35. 
Several specific results are noted in the following. 

First, 8 out of the 20 clusters are perfectly (100%) co- 
herent. For example, cll consists of 18 companies which 
are all Information Technology companies, mainly sub- 
classified as Semiconductors & Semiconductor Equipment 
companies such as Intel and Texas Instruments. In 
contrast, cl7 consists mainly of different types of retail 
stores: Department Stores like Sears, General Merchan- 
dise Stores like Target, Speciality Stores like Staples, and 
so on. 

Perhaps more interesting is the relatively subtle dis- 
tinction between c7 and cl, both of which are perfectly 
coherent. The former includes mainly companies which 
are classified as Investment Banking & Brokerage {e.g., 
Merrill Lynch) or Asset Management & Custody Banks, 
while the latter corresponds to Commercial (Regional) 
Banks like PNC. Indeed, in Figure ITUI we see that these 
two clusters nicely merge with each other at the inde- 
pendent solution found for Nc = 15 clusters. A similar 
relatively subtle distinction also is captured between c6 
and c20 (again, both are perfectly coherent), where both 
clusters correspond to different sub-classifications of the 
Oil & Gas category. As for the previous pair, these two 
clusters also merge for Nc = 15. 

Even in clusters with non-perfect coherence we typi- 
cally see a clear reasoning behind the automatically re- 
covered structure. For example, cl6 is enriched only for 
three Hotels Resorts & Cruise Lines' companies, with 
a coherence level of only 30%. Nonetheless, it further 
contains two banks {MBNA and Capital One Financial) 
which specialise in credit card issuing and therefore con- 
sumer spending, a company {CINTAS) which is a builder 
of corporate identity, and another company {Paychex) 
which handles payroll and human resource services for 
employees. In addition, the Walt Disney Co. is also in 
this cluster, presumably due to its parks and resorts di- 
vision. 



V. THIRD APPLICATION: THE EACHMOVIE DATA 
A. Description of the data 

In our third test case we consider the EachM ovie 
dataset, movie ratings provided by more than 70, 000 
viewers.^ These data are inherently quantized as only 
six discrete possible ratings were used. Indeed, many real 
life clustering problems involve such categorical data. In 
these cases the issue of what similarity measure to use 
seems even more obscure, especially if the descriptive at- 



^ See^http://www. research. digital. com, 
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tributes are not naturally ordered, and our general infor- 
mation theoretic approach seems especially promising. 

We represented each movie by its ratings from differ- 
ent viewers and focused on the 500 movies that got the 
maximal number of votes. These data are presented in 
Figure El We estimated all the ~ 125,000 mutual in- 
formation relations as in the previous applications; again 
see Ref for details. Notice that in estimating the mu- 
tual information for a pair of movies, only viewers who 
voted for both movies can be considered. 

Across all pairs of movies, the average estimated mu- 
tual information was 0.052 bits, with a variance of 
0.0026 bits^, and the maximal estimated mutual informa- 
tion was 0.89 bits. All the pairwise mutual information 
relations are presented in Figure lT^ where the movies are 
sorted according to the clustering partition into Nc = 20 
clusters that we analyze in detail (see below). The self- 
information relations were set to la = log2(6) which cor- 
responds to the maximal possible information under a 
quantization into six bins. 



B. Quality-complexity trade-off curves 

Given the pairwise mutual information matrix we ap- 
plied the Iclust algorithm described in Section U As 
in the previous applications, we explored the trade-off 
between (s) and /(C; i) for different numbers of clus- 
ters: Nc = 5, 10, 15, 20 and for different values of the 
trade-off parameter, T. Specifically, we found that 
1/T ^ {20, 25, 30, 35, 40} were typically sufficient to ob- 
tain a relatively clear saturation of (s), hence we present 
the results for these T values. For each {Nc , T} pair 
we performed 10 different random initializations ending 
up with 10 (possibly) different local maxima of J-', from 
which we chose the best one. The resulting trade-off 
curves are presented in the right panel of Figure 21 

As before, as T is lowered, (s) and I{C; i) increase and 
the solutions become more deterministic. For example, 
for Nc = 20 and 1/T = 30, only - 32% of the movies 
have nearly deterministic assignment, while for 1/T = 40 
almost all the movie assignments are nearly deterministic 
[F(C|i) > 0.9 for a particular C]. 

For brevity, we focus our analysis on solutions for 
which the saturation of (s) is relatively clear, i.e., on the 
four solutions with Nc = {5,10,15,20} and 1/T = 40, 
and we treat these solutions as hard partitions where ev- 
ery movie is assigned solely to its most probable cluster. 



C. Comparing solutions at different numbers of clusters 

We examine directly how well our independent solu- 
tions form a hierarchical structure by applying the same 
scheme as in Section Fill. CI The results are presented in 
FigureEl Clearly, the relations between solutions at dif- 
ferent numbers of clusters are relatively weak, suggesting 
that the data really do not support a robust hierarchical 



structure. Only a few clusters are somewhat preserved as 
we vary Nc, like the Family- Animation-Classic cluster, 
cl2, or the Action cluster, c9. 

D. Coherence results 

1. Constructing the annotation matrices 

We used the genre labels provided for these data to 
construct the annotation matrix. Specifically, these la- 
bels are: Action (110 movies). Animation (25 movies), 
Art-Foreign (45 movies). Classic (44 movies). Comedy 
(149 movies). Drama (160 movies). Family (67 movies). 
Horror (33 movies), Romance (61 movies), and Thriller 
(90 movies). Almost half of the movies were annotated 
with more than one genre and the average number of 
genre annotations per movie was 1.6, with a maximal 
number of 4 different genres for a single movie. 

It is important to notice that these annotations are 
broad, providing a somewhat simplistic view of the struc- 
ture in these data. For example, it is quite reason- 
able that more subtle distinctions like the movie director 
and/or main actors are reflected in the viewer prefer- 
ences that were used to cluster the movies. Nonetheless, 
for practical reasons we used these broad labels as a first- 
order approximation for our evaluation. 

2. Coherence results and comparisons 

We estimated the statistical coherence of the clusters 
obtained at the low-temperature end of the trade-off 
curves where 1/T = 40. As before, to gain some per- 
spective, we also used the Cluster 3.0 software fl^. We 
experimented with the same 18 basic configurations as in 
the previous applications (ii'-means variants, again with 
100 different initializations), and applied the compari- 
son to all the different numbers of clusters we examined, 
Nc — 5, 10, 15, 20. The results are summarized in Ta- 
ble jXXH to Table IXIXl and in Figure CSl 

In all cases, Iclust was clearly superior to the aver- 
age performance of the K-vneans and the Hierarchical 
Cluster 3.0 variants. In fact, except for the hierarchical 
complete-linkage configuration with the Pearson correla- 
tion as the similarity measure, none of the other algo- 
rithms came even close to the Iclust performance. Aver- 
aging over all four Nc values, Iclust obtains an average co- 
herence of ~ 53% while the average if -means coherence 
is only ^ 12% and the average Hierarchical coherence is 
- 24%. 

Notice that, in contrast to the previous applications, 
here the i^T-means algorithms are inferior to some of the 
hierarchical algorithms (and both are inferior to Iclust). 
These results demonstrate that while standard cluster- 
ing algorithms might work well in certain circumstances 
and fail completely in others, our principled and model- 
independent approach maintains a high and robust per- 
formance across a wide variety of applications. 
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E. Detailed results for Nc — 20 clusters 

In Table IXXI we present all enriched annotations for 
the Iclust partition with = 20 clusters and l/T — 40. 
Several results should be noted specifically. 

For example, cl2 consisted solely of 14 classic fam- 
ily movies such as The Wizard of Oz and Snow White. 
c8 consisted mainly of Art-Foreign movies, including all 
the Three Colors trilogy by Kieslowski. cl5 included all 
seven Star Trek movies. Moreover, some of the obtained 
clusters reflect more subtle distinctions than the broad 
genre definitions. For example, both c4 and c6 were en- 
riched for Comedy^ but while c4 was further enriched for 
Romance c6 consisted mainly of Jim Carrey and Adam 
Sandler movies. Both c7 and cl7 were enriched for Ac- 
tion^ but c7 was further enriched for Classic with some 
emphasis on Science Fiction movies such as the Star 
Wars trilogy, the Terminator movies. Alien, and Back 
to the Future. In contrast cl7 consisted mainly of movies 
starring Sylvester Stallone, Jean-Claude Van Damme etc. 
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Input: 

PairwisG similarity matrix, s(ii,i2), V ii = 1, N, 12 = 1, N . 
trade-off parameter, T . 
Requested number of clusters, Nc . 
Convergence parameter, e . 

Output: 

A (typically "soft") partition of the N elements into Nc clusters. 

Initialization: 

m — . 

P''"'(C|i) ^ A random (normalized) distribution V i = 1, N . 

While True 

For every i = 1, A'^ : 

, p{m+i) ^ p(m) I ^ jgst") (C; i) - s^™) (C)] I , V C = 1, A, . 

. p('"+i)(qi)^-^g:^^l!i^ 

• m <— m + 1 . 

If Vi= l,..,Ar, VC = l,..,Are we have \P^"'+^\C\i) - P^"'\C\i)\ < e , 
Break. 



FIG. 1 Pscudo-codc of the Iclust algorithm. Extending the algorithm for the general case (of more than pairwise relations) 
is straightforward. In principle we repeat this procedure for different initializations and choose the solution which max;imizes 
^=(s)-r7(C;i) . 
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FIG. 2 Expression profiles of the 868 genes in the ESR data across 173 microarray experiments. Data taken from Gasch et. al 
1^3) • Missing values are set to zero. The genes are sorted according to the clustering partition into 20 clusters that we analyze 
in detail later on. Inside each cluster, genes are sorted according to the average mutual information relation with other cluster 
members. 
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FIG. 3 Pairwise mutual information relations for the 868 genes in the ESR data. The genes are sorted according to the 
clustering partition into 20 clusters that we analyze in detail later on. Inside each cluster, genes are sorted according to the 
average mutual information relation with other cluster members. 
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FIG. 4 (Left) Tradeoff curves obtained for the ESR data. Each curve describes the solutions obtained for a particular Nc 
value, i.e., for a fixed number of clusters. Different points along each curve correspond to different local maxima of JT at 
different T values. The results are presented for ^ = {5, 10, 15, 20, 25} which suffices to obtain a relatively clear saturation 
of the average mutual information per cluster, (s). In Section fill. CI we explore the possible hierarchical relations between the 
four saturated solutions at t^t = 25 and Nc = {5, 10, 15, 20} . Further detailed analysis refers to the solution with Nc = 20 and 
^ = 25 that obtained the highest (s) value. (Middle) Similar tradeoff curves that were obtained for the SP500 data. The 
results are presented for ^ — {15, 20, 25, 30, 35} which suffices to obtain a relatively clear saturation of (s). Notice, that due to 
the lower average mutual information relations in these data (with respect to the ESR example), one must apply lower T values 
to obtain a clear saturation. In Section riV.CI we explore the possible hierarchical relations between the four saturated solutions 
at ^ = 35 and Nc = {5, 10, 15, 20} . Further detailed analysis refers to the solution with Nc — 20 and ^ = 35. (Right) 
Similar tradeoff curves that were obtained for the EachMovie data. The results are presented for ^ = {20, 25, 30, 35, 40} which 
suffices to obtain a relatively clear saturation of (s). In Section fV. CI we explore the possible hierarchical relations between the 

40 and Nc = {5, 10, 15, 20} . Further detailed analysis refers to the solution with Nc = 20 and 
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FIG. 5 Relations between the optimal solutions with A'^c ~ {5, 10, 15, 20} at ^ = 25 for the ESR data. At the upper level, 
Nc = 20 clusters, and the clusters are sorted as in Figure|21and Figure|31 The numbers above every cluster indicate the number 
of genes in this cluster. The title of each cluster correspond to the most enriched GObp (biological process) annotation in the 
cluster, i.e., to the GObp annotation with the smallest P- value in the cluster (see Section llll.D.H . The only exceptions are c6, 
not enriched in GObp, and cl9, enriched with a non-informative annotation [response to stress). For these two clusters we use 
their most enriched GOmf (molecular function) annotation as a title. The titles of the five clusters at the lower level (A^c = 5) 
are by their most enriched GOcc (cellular component) annotation. Notice, that most clusters were enriched with more than 
one annotation, hence the short titles might be too concise in some cases (see Section fill. El for a detailed description of every 
cluster at the top level). Red and green clusters represent clusters with a clear majority of stress-induced or stress-repressed 
genes, respectively. In the cytoplasm cluster we had a relatively balanced mixture of stress-repressed (58%) and stress-induced 
(42%) genes. 
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FIG. 6 ESR data: Comparison of average coherence results of the Iclust algorithm (yellow) with conventional clustering 
algorithms ITH l: T^-means (green); A'-medians (blue); Hierarchical (red). For the hierarchical algorithms, four different 
variants are tried: complete, average, centroid, and single linkage, respectively from left to right. For every algorithm, three 
different similarity measures are applied: Pearson correlation (left); absolute value of Pearson correlation (middle); Euclidean 
distance (right). The white bars correspond to applying the algorithm to the logarithmically transformed expression ratios. In 
all cases, the results are averaged over all the different numbers of clusters that we tried: Nc = 5, 10, 15, 20, and over the three 
Gene Ontologies. 
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FIG. 7 Examples of the average behavior of some of the clusters obtained with Nc = 20. Error-bars indicate standard deviation. 
The vertical axis measures the logj of expression ratio. The dashed ("Other genes") curve displays the average behavior of the 
repressed genes, excluding those in the clusters that are mentioned in the figure. In panel A the upper dashed curve corresponds 
to the average behavior of the induced genes, excluding those in c5. (A) c5 in Menadione exposure and stationary phase. (B) 
c8 and cl9 in different temperatures shifts. (C) cl6 and cl8 in Diamide treatment. (D) c7 and c2 in Nitrogen depletion. (E) 
cl7 and cl5 in steady-state growth. (F) cl7 and cl5 in hyper-osmotic shock. 
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FIG. 8 Fractional changes in stock price of the Standard and Poor's companies we considered during the 273 trading days 
of December 2002 - December 2003. The companies are sorted according to the clustering partition into 20 clusters that we 
analyze in detail later on. Inside each cluster, companies are sorted according to the average mutual information relation with 
other cluster members. 




FIG. 9 Pairwise mutual information relations for the SP500 data. The companies are sorted according to the clustering 
partition into 20 clusters that we analyze in detail later on. Inside each cluster, companies are sorted according to the average 
mutual information relation with other cluster members. 
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FIG. 10 Relations between the optimal solutions with Nc — {5, 10, 15, 20} at ^ — 35 for the SP500 data. At the upper level, 
Nc = 20 clusters, and the clusters are sorted as in Figure|Hland FigureEl The numbers above every cluster indicate the number 
of companies in this cluster. The title of each cluster correspond to the most enriched annotation in the cluster, i.e., to the 
annotation with the smallest P-value in the cluster. Similar color of text boxes indicate that the corresponding annotations 
belong to the same major sector of economy (see Section llV.D.l|l . Notice, that most clusters were enriched with more than 
one annotation, hence the short titles might be too concise in some cases (see Section fl V .El for a detailed description of every 
cluster) . 
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FIG. 11 SP500 data: Comparison of average coherence results of the Iclust algorithm (yellow) with conventional clustering 
algorithms ijliift : iC'-means (green); isT-medians (blue); Hierarchical (red). For the hierarchical algorithms, four different variants 
are tried: complete, average, centroid, and single linkage, respectively from left to right. For every algorithm, three different 
similarity measures are applied: Pearson correlation (left); absolute value of Pearson correlation (middle); Euclidean distance 
(right). In all cases, the results are averaged over all the different numbers of clusters that we tried: Nc = 5, 10, 15, 20. 
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FIG. 12 Discrete movie ratings for the 500 movies with the maximal number of votes in the EachMovie data. The ratings are 
presented only for the 1000 viewers who rated the maximal number of movies. Zeros represent missing values (i.e., no vote). 
The movies are sorted according to the clustering partition into 20 clusters that we analyze in detail later on. Inside each 
cluster, movies are sorted according to the average mutual information relation with other cluster members. 




FIG. 13 Pairwise mutual information relations for the EachMovie data. The movies are sorted according to the clustering 
partition into 20 clusters that we analyze in detail later on. Inside each cluster, movies are sorted according to the average 
mutual information relation with other cluster members. 
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FIG. 14 Relations between the optimal solutions with Nc = {5, 10, 15, 20} at ^ — 40 for the EachMovie data. At the upper 
level, Nc — 20 clusters, and the clusters are sorted as in Figure fT^ and Figure [T!T1 The numbers above every cluster indicate 
the number of movies in this cluster. The title of each cluster corresponds to (all) enriched genre annotations in the cluster, 
i.e., to all annotations with a (Bonferroni corrected) P- value below 0.05. See Section IV.EI for a detailed description of every 
cluster. 
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FIG. 15 EachMovie data: Comparison of average coherence results of the Iclust algorithm (yellow) with conventional 
clustering algorithms ijliil) : if -means (green); Jf-medians (blue); Hierarchical (red). For the hierarchical algorithms, four 
different variants are tried: complete, average, centroid, and single linkage, respectively from left to right. For every algorithm, 
three different similarity measures are applied: Pearson correlation (left); absolute value of Pearson correlation (middle); 
Euclidean distance (right). In all cases, the results are averaged over all the different numbers of clusters that we tried: 
Nc = 5, 10, 15, 20. 
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TABLE I A simple example for an aimotation matrix. Here, the total number of elements is iV = 5 and the total number of 
distinct annotations is R = 4. The first element is assigned the second and third annotations, and so on. 



Element index 


ai 


a2 


a:i 


04 


elementi 





1 


1 





element2 


1 





1 


1 


elements 


1 











elementi 





1 


1 


1 


element^ 


1 





1 


1 



TABLE II Examples of P-values. When the annotation is over-abundant in the cluster (with respect to its frequency in the 
entire population) the P-value is reduced accordingly. 



N 

(Population size) 


K 

(Annot. freq.) 


n 

(Cluster size) 


X 

(Annot. freq. in cluster) 


Pval 


1000 


100 


50 


5 


0.57 


1000 


100 


50 


20 


10"' 


1000 


20 


100 


2 


0.61 


1000 


20 


100 


20 


10-21 



TABLE III A small subset of the Asp annotation matrix, constructed for the ESR data out of the GO bp ontology. 



ORF 


Metabolism 


Transcription 


RNA processing 


Ribosome biogenesis 




YKL144C 


1 


1 










YML060W 


1 













YGR251 W 


1 


1 


1 


1 




YLL036C 


1 





1 







YNL163C 











1 
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TABLE IV An example for a subset of genes from a single cluster that are assigned different specific GO bp terms. The func- 
tional relationship between these genes becomes statistically significant only if one considers the fact that all these annotations 
have a common ancestor in the GObp database, the tRNA aminoacylation for protein translation term. 



ORF 


Direct GObp annotation 


YDR037W 


lysyl-tRNA aminoacylation 


YGR094 W 


valyl-tRNA aminoacylation 


YLR060W 


phenylalanyl-tRNA aminoacylation 


YNEuchdean47W 


cysteinyl-tRNA aminoacylation 


YPL160W 


loucyl-tRNA aminoacylation 



TABLE V Details of the different annotation matrices used for evaluating the statistical significance of the obtained clusters for 
the yeast ESR genes. "Data source for constructing the annotation matrix. ''Number of distinct annotations in the annotation 
matrix, assigned at least two genes and thus participate in the analysis. '^Number of genes assigned at least one annotation 
and thus participate in the analysis. Notice that this number determines the population size (A'') for the P-value estimation. 
''Average number of distinct annotations per gene. "^Maximal number of distinct annotations for a single gene. 



Data source" 


# Annotations'' 


# Genes'^ 


Avg. # Annot. per gene'* 


Maximal # Annot. per gene'^ 


GObp (13) 


472 


614 


11.4 


63 


GOmf (13) 


215 


561 


4.6 


18 


GOcc 


94 


747 


5.4 


14 
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TABLE VI Coherence results for the ESR data with respect to the three Gene Ontologies with Nc = 20 clusters. "Clustering 
algorithm. In the ( Jf-means ) row we present the average results of all the six /f-means variants. For each of these variants 
we performed 100 runs from which the best solution is chosen. In the ( Hier. } row we present the average results of all the 

12 Hierarchical clustering variants. In parenthesis we present the results where the input are the log2 of the expression ratio 
profiles. ''Correlation measure used by the algorithm. PC stands for the (centered) Pearson Correlation. \PC\ is the absolute 
value of this correlation. Euclidean stands for the Euclidean distance. '^Number of clusters with a positive coherence with 
respect to the GO bp ontology. ''Average coherence of all 20 clusters with respect to the GO bp ontology. '^Number of clusters 
with a positive coherence with respect to the GOmf ontology. ■'^ Average coherence of all 20 clusters with respect to the GOmf 
ontology. ''Number of clusters with a positive coherence with respect to the GOcc ontology. ''Average coherence of all 20 
clusters with respect to the GOcc ontology. 



A^c = 20 




BP 


BP 


MF 


MF 


cc 


CC 


Algorithm " 


Similarity '' 


JYPOS c 


{Coh} 




{Coh) ^ 




(Co/i) 


Iclust 


mutual information 


17 


51 


16 


41 


14 


33 


Jf-means 


PC 


11 (13) 


30 (43) 


11 (11) 


31 (31) 


10 (12) 


19 (30) 


it'-means 


|PC| 


9 (15) 


27 (50) 


8 (14) 


24 (40) 


8 (16) 


26 (42) 


JsT-means 


Euclidean 


7(15) 


23 (52) 


9 (15) 


26 (39) 


5 (16) 


13 (51) 


it'-medians 


PC 


11 (15) 


35 (51) 


13 (16) 


34 (48) 


10 (15) 


35 (46) 


JsT-medians 


|PC| 


12 (15) 


38 (41) 


16 (16) 


43 (39) 


13 (11) 


37 (35) 


-ff-medians 


Euclidean 


16 (18) 


49 (52) 


15 (14) 


39 (44) 


13 (16) 


43 (51) 


( Tf-mcans ) 




11.0 (15.2) 


33.7 (48.2) 


12.0 (14.3) 


32.8 (40.2) 


9.8 (14.3) 


28.8 (42.5) 


Hier - Comp. linkage 


PC 


9 (13) 


29 (41) 


10 (10) 


25 (30) 


7(12) 


19 (34) 


Hier - Comp. linkage 


|PC| 


9 (10) 


25 (26) 


12 (9) 


31 (27) 


7(10) 


17 (26) 


Hier - Comp. linkage 


Euclidean 


1 (13) 


2 (43) 


3(11) 


8 (32) 


1 (8) 


2 (27) 


Hier - Avg. linkage 


PC 


5 (7) 


17 (20) 


5(5) 


18 (17) 


4(4) 


11 (12) 


Hier - Avg. linkage 


|PC| 


5 (4) 


17 (10) 


5(2) 


18 (8) 


4(2) 


10 (4) 


Hier - A\'g. linkage 


Euclidean 


i (9) 


2 (21)) 


i (i) 


i (17) 


2 (G) 


6 (16) 


Hier - Centr. linkage 


PC 


4(3) 


12 (10) 


4(3) 


12 (10) 


4(2) 


11 (8) 


Hier - Centr. linkage 


|PC| 


4(4) 


12 (12) 


3(4) 


7(11) 


4(2) 


9 (4) 


Hier - Centr. linkage 


Euclidean 


0(4) 


(13) 


0(4) 


(12) 


1 (2) 


1 (8) 


Hier - Sing, linkage 


PC 


2(2) 


8(8) 


2(2) 


7(7) 


2(2) 


8(8) 


Hier - Sing, linkage 


|PC[ 


2 (0) 


6 (0) 


1 (0) 


5(0) 


0(0) 


0(0) 


Hier - Sing, linkage 


Euclidean 


0(0) 


0(0) 


0(0) 


0(0) 


0(0) 


0(0) 


( Hier. ) 




3.5 (5.8) 


10.8 (17.7) 


3.8 (4.5) 


11.0 (14.2) 


3.0 (4.2) 


7.8 (12.2) 
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TABLE VII Coherence results for the ESR data with respect to the three Gene Ontologies with Nc — 15 clusters. The column 
and row definitions are as in TableEH Again, in parenthesis we present the results where the input are the log2 of the expression 
ratio profiles. 



Nc = 15 




BP 


BP 


MF 


MF 


CC 


CC 


Algorithm " 


Similarity 


71 run, s c 


{Coh) 




{Coh) 




{Coh) 


ICiUSt 


mutual information 


1 9 


Oi 


1 A 


04 


1 A 
14 


Oz 


K-me&ns 


FC 


7 (14) 


29 (55) 


8 (13) 


32 (49) 


7 (10) 


18 (38) 


Tf-means 


PC| 


10 (14) 


40 (47) 


9 (11) 


32 (37) 


8 (12) 


27 (38) 


K-mesLns 


Euclidean 


10 (12) 


33 (50) 


8 (13) 


36 (46) 


3 (11) 


14 (44) 


Tf-medians 


PC 


11 (13) 


40 (46) 


11 (13) 


41 (49) 


10 (14) 


41 (47) 


Ji'-medians 


|PC| 


11 (14) 


42 (50) 


11 (13) 


31 (44) 


10 (11) 


35 (38) 


Tf-medians 


Euclidean 


11 (14) 


50 (58) 


12 (13) 


AC\ 1 A 0\ 

42 (43) 


11 (13) 


46 (61) 


( /i'-means ) 




10.0 (13.5) 


39.0 (51.0) 


9.8 (12.7) 


35.7 (44.7) 


8.2 (11.8) 


30.2 (44.3) 


Hier - Comp. linkage 


PC 


8(11) 


32 (43) 


9(8) 


31 (31) 


6 (9) 


20 (44) 


Hier - Comp. linkage 


|PC| 


4(8) 


17 (29) 


7(7) 


21 (29) 


5 (8) 


15 (32) 


Hier - Comp. linkage 


Euclidean 


0(8) 


(33) 


1 (8) 


2 (29) 


1 (6) 


2 (27) 


Hier - Avg. linkage 


PC 


5(5) 


21 (22) 


4(5) 


18 (21) 


4(3) 


13 (12) 


Hier - Avg. linkage 


|PC| 


4(4) 


15 (13) 


3(3) 


11 (12) 


3(2) 


5 (5) 


Hier - Avg. linkage 


Euclidean 


2 (7) 


(oO) 


1 (4) 


1 (22) 


I (4) 


8 (14) 


Hier - Centr. linkage 


PC 


4(3) 


16 (13) 


4(3) 


16 (15) 


4(3) 


14 (12) 


Hier - Centr. linkage 


|PC| 


4(3) 


16 (11) 


3(3) 


7(11) 


4(3) 


11 (6) 


Hier - Centr. linkage 


Euclidean 


0(3) 


(15) 


0(3) 


0(11) 


0(2) 


0(11) 


Hier - Sing, linkage 


PC 


2(2) 


11 (11) 


2(2) 


9 (9) 


2 (2) 


11 (11) 


Hier - Sing, linkage 


|PC| 


0(0) 


0(0) 


0(0) 


0(0) 


0(0) 


0(0) 


Hier - Sing, linkage 


Euclidean 


0(0) 


0(0) 


0(0) 


0(0) 


1 (0) 


6(0) 


( Hier. ) 




2.8 (4.5) 


11.3 (18.8) 


2.8 (3.8) 


9.7 (15.8) 


2.7 (3.5) 


8.8 (14.5) 
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TABLE VIII Coherence results for the ESR data with respect to the three Gene Ontologies with A'^c ~ 10 clusters. The 
column and row definitions are as in Table fVIl Again, in parenthesis we present the results where the input are the log2 of the 
expression ratio profiles. 



Nc = 10 




BP 


BP 


MF 


MF 


CC 


CC 


Algorithm " 


Similarity 




[Coh) 




{Coh) ^ 




{Coh) 


Iclust 


mutual information 


7 
( 


ou 


7 
/ 




Q 


oo 


7<'-means 


FC 


8 (9) 


45 (52) 


o /'o\ 

8 (8) 


A A (a '7\ 

44 (47) 


7 (9) 


37 (56) 


K-means 


|PC 


7 (9) 


41 (48) 


6 (8) 


42 (41) 


8 (8) 


39 (48) 


/T-means 


Euclidean 


5 (10) 


27 (62) 


6 (10) 


30 (57) 


3 (8) 


22 (55) 


/S'-medians 


PC 


9 (10) 


51 (57) 


8 (9) 


45 (53) 


9 (10) 


49 (54) 


/T-niedians 


|PC 


7 (9) 


45 (52) 


8 (9) 


50 (47) 


9 (8) 


41 (56) 


/S'-medians 


Euclidean 


7 (9) 


48 (62) 


8 (9) 


46 (60) 


7 (9) 


49 (58) 


{ Tf-means ) 




7.2 (9.3) 


42.8 (55.5) 


7.3 (8.8) 


42.8 (50.8) 


7.2 (8.7) 


39.5 (54.5) 


Hier - Comp. linkage 


PC 


6(8) 


33 (44) 


7(5) 


43 (32) 


5 (7) 


26 (43) 


Hier - Comp. linkage 


|PC| 


4(6) 


24 (33) 


6 (5) 


32 (37) 


5 (6) 


22 (30) 


Hier - Comp. linkage 


Euclidean 


2 (7) 


12 (41) 


2 (7) 


8 (39) 


2 (5) 


7 (32) 


Hier - Avg. linkage 


PC 


3(4) 


19 (30) 


3(4) 


20 (30) 


3(3) 


18 (19) 


Hier - Avg. linkage 


|PC| 


2 (4) 


8 (20) 


1 (3) 


7 (19) 


1 (2) 


1 (7) 


Hier - Avg. linkage 


Euclidean 


U (4) 


U (oo) 


(5) 


U \2o) 


(4) 


U \2\i) 


Hier - Centr. linkage 


PC 


3(3) 


19 (19) 


3(3) 


21 (20) 


3(3) 


18 (18) 


Hier - Centr. linkage 


|PC| 


4(3) 


19 (21) 


3 (3) 


11 (17) 


4(3) 


16 (9) 


Hier - Centr. linkage 


Euclidean 


0(3) 


(21) 


1 (3) 


2 (17) 


0(3) 


0(9) 


Hier - Sing, linkage 


PC 


2 (2) 


16 (16) 


2 (2) 


13 (14) 


2 (2) 


17 (17) 


Hier - Sing, linkage 


|PC| 


(0) 


0(0) 


0(0) 


0(0) 


0(0) 


0(0) 


Hier - Sing, linkage 


Euclidean 


0(0) 


0(0) 


0(0) 


0(0) 


0(0) 


0(0) 


( Hier. ) 




2.2 (3.7) 


12.5 (23.2) 


2.3 (3.3) 


13.1 (21.1) 


2.1 (3.2) 


10.4 (17.0) 
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TABLE IX Coherence results for the ESR data with respect to the three Gene Ontologies with Nc = 5 clusters. The column 
and row definitions are as in TableEH Again, in parenthesis we present the results where the input are the log2 of the expression 
ratio profiles. 



Nc ^ 5 




BP 


BP 


MF 


MF 


CC 


CC 


Algorithm " 


Similarity 




[Coh) 




{Coh) ^ 




[Coh) 


Iclust 


mutual information 





t 





11 
{ { 





oD 


7<'-means 


FC 


(5) 


62 (65) 


(5) 


63 (65) 


5 (oj 


75 (73) 


K-means 


|PC 


5 (5) 


61 (70) 


5 (5) 


62 (67) 


5 (5) 


75 (70) 


T^-means 


Euclidean 


i (5) 


43 (71) 


3 (5) 


35 (56) 


3 (4) 


39 (65) 


/S'-medians 


PC 


5 (5) 


64 (62) 


5 (5) 


65 (63) 


5 (5) 


72 [72) 


T^-medians 


|PC 


5 (5) 


57 (59) 


5 (5) 


52 (58) 


5 (5) 


75 (69) 


/S'-medians 


Euclidean 


4 (5) 


52 (71) 


4 (5) 


59 (60) 


4 (4) 


68 (57) 


{ Tf-means ) 




4.5 (5.0) 


56.5 (66.3) 


4.5 (5.0) 


56.0 (61.5) 


4.5 (4.7) 


67.3 (67.7) 


Hier - Comp. linkage 


PC 


4 (4) 


42 (44) 


5 (4) 


52 (46) 


4 (4) 


37 (57) 


Hier - Comp. linkage 


|PC| 


4 (4) 


47 (51) 


5 (3) 


34 (44) 


3 (4) 


30 (45) 


Hier - Comp. linkage 


Euclidean 


1 (3) 


11 (37) 


2 (4) 


13 (49) 


(4) 


(36) 


Hier - Avg. linkage 


PC 


3 (3) 


38 (39) 


3 (3) 


40 (47) 


3 (3) 


36 (37) 


Hier - Avg. linkage 


|PC| 


(1) 


0(6) 


0(1) 


(13) 


(2) 


0(8) 


Hier - Avg. linkage 


Euclidean 


n /'o^ 
U (_zj 


(31) 


U (6) 


n ( Q^^ 
U \iM) 


n /'o^ 
U 


U (oo) 


Hier - Centr. linkage 


PC 


3 (2) 


39 (32) 


3 (2) 


41 (27) 


3 (2) 


36 (33) 


Hier - Centr. linkage 


|PC| 


3 (1) 


21 (8) 


2 (0) 


19 (0) 


3 (1) 


13 (6) 


Hier - Centr. linkage 


Euclidean 


0(1) 


0(8) 


(0) 


0(0) 


0(1) 


0(6) 


Hier - Sing, linkage 


PC 


2 (2) 


32 (32) 


2 (2) 


27 (27) 


2 (2) 


33 (33) 


Hier - Sing, linkage 


|PC| 


(0) 


0(0) 


(0) 


0(0) 


(0) 


0(0) 


Hier - Sing, linkage 


Euclidean 


(0) 


0(0) 


(0) 


0(0) 


(0) 


0(0) 


( Hier. ) 




1.7 (1.9) 


19.2 (24.0) 


1.8 (1.8) 


18.8 (23.6) 


1.5 (2.1) 


15.4 (24.5) 
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TABLE X Enriched GO annotations in the Iclust solution with Nc — 20 clusters. Clusters are ordered as in Figure H Figure 01 
and Figure |5] Only annotations with a P-value below 0.05 (Bonferroni corrected) are presented. "Cluster index, number of 
repressed genes in the cluster, and number of induced genes in the cluster. ''Cluster coherence (in percentage) in each of the 
three Gene Ontologies. '^Enriched annotations for the GObp ontology. In parentheses: {x/K,p) stands for the number of genes 
in the cluster to which this annotation is assigned, the number of genes in the ESR module to which this annotation is assigned, 
and the Bonferroni corrected P-value, respectively. "^Enriched annotations for the GOmf ontology. Parenthesis are as in the 
previous column. '^Enriched annotations for the GOcc ontology. Parenthesis are as in the previous column. 





Coh.*" 


UJ. 1 J 1 1 ' l^llti^U. CXLLLL\J\j m 


A/TP FlnrirVipH annnt ^ 


ill; Ti^.TiT'i r'rii^ri mrnof" ^ 


cl2 


BP 


18 


nucleoside monophosphate metabolism (5/6,0.0032) 


diphosphotransferase activity (4/4,0.0047) 


- 


Rop : 57 


MF 


9 


purine nuelcoside monophosphate meta (5/6,0.0032) 






Ind : 3 


CC 





ribonuclcoaidc monophosphate metabol (5/6,0.0032) 












purine ribonuclcosidc monophosphate (5/6,0.0032) 












histidine biosynthesis (4/ 4,0. 0076) 












histidine metabolism (4/4,0.0076) 












histidine family amino aeid metaboli (4/4,0.0076) 


- 


- 








histidine family amino acid biosynth (4/4,0.0076) 












nucleoside monophosphate biosynthcsi (4/4,0.0076) 


- 


- 








purine nucleoside monophosphate bios (4/4,0.0076) 












ribonucleoside monophosphate biosy nt (4/4,0. 0076) 












purine riDonucicosiQc monopnosp naxe ^*±/*±.u.uu t u j 












Diogcnic amine Diosyntnesis ^^4/4,u.uu i u j 












amine biosynthesis (7/16 0187) 












amino acid derivative biosynthesis (4/5,0.0359) 






cl5 


BP 


67 


tR.NA aminoacylation for protein tran (8/14,0.00) 


tR.NA ligase activity (9/17,0.00) 






MF 


75 




RNA ligase activity (9/17,0.00) 




Ind : 


CC 





- 


ligase activity, forming carbon-oxyg (9/17,0.00) 


- 










ligase activity, forming aminoacyl-t (9/17.0.00) 










- 


ligase activity, forming phosphoric (9/17,0.00) 


- 










ligase activity (9/23,0.00) 




cl4 


BP 


57 


vacuolar acidification (3/3,0.0010) 


hydrogen-transporting ATPasc activit (3/3,0.0005) 


chaper. -contain. T-compl. (3/3,0.0001) 


Rep : 15 


MF 


43 


regulation of pH (3/4,0.0038) 


ion transporter activity (4/8.0.0006) 


hydrog.-transloc. ATPase (3/3.0.0001) 




CC 


53 


monovalent inorganic cation homcosta (3/4,0.0038) 


monovalent inorganic cation transpor (3/4,0.0020) 


vacuolar membrane (3/8,0.0073) 








hydrogen ion homeostasis (3/4,0.0038) 


hydrogen ion transporter activity (3/4,0.0020) 


cytoskeleton (3/8,0.0073) 








organelle organization and biogenesi (6/38,0.0080) 


/\ir^asc activity, coupico to trans mem l^o / o,U.uU'y:o ) 


nyarog.-transp. i\. lipase vu ^i/i.u.uu/yj 








vacuolar transport (3/5,0.0093) 


cation transporter activity (3/6,0.0095) 


membrane (5/47,0.0312) 








caxion nomcostasis (^o/ i ,U. UoJ. i ) 


asparagine synthase (glutami ne- hydro) (2/2.0.0232) 










asparaginc biosynthesis (2/2,0.0488) 


primary active transporter activity (3/9,0.0382) 










aspartate family amino acid biosynth (2/2,0.0488) 


P-P-bond-hydrolysis-driven transport (3/9,0.0382) 


- 










hydrolase activity, acting on acid a (3/9,0.0382) 










- 


ATPase activity, coupled to transmem (3/9,0.0382) 


- 


clO 


BP 





- 


- 


- 


Rep : 35 


MF 











Ind : 1 


CC 





- 


- 


- 




BP 


100 


protein biosynthesis (15/189,0.00) 


structural constituent of ribosome (12/127,0.0001) 


ribosome (14/153,0.00) 


Rep : 17 


MF 


SO 


macromolecule biosynthesis (15/189,0.00) 


structural molecule activity (12/128,0.0001) 


ribonuclcopro. complex (14/186,0.00) 


Ind : 


CC 


100 


biosynthesis (15/236,0.00) 




cytos. large ribos. subunit (10/75,0.00) 








translational elongation (5/10,0.00) 




large ribosomal subunit (10/75,0.00) 








protein metabolism (15/252,0.00) 




cytos. ribos. (sens. Eukar.) (12/132,0.00) 












cytosol (12/165,0.0001) 












cytoplasm (16/525,0.0430) 


c3 


BP 


17 


transcription from Pol II promoter (5/11,0.0077) 


ribonuclease activity (4/10,0.039) 


DNA-dircct. RNA polym. Il-core (4/7,0.004) 


Rep : 29 


MF 


15 






DNA-direct. RNA polym. Il-holo (4/9,0.012) 


Ind : 6 


CC 


23 






cytoplas. exosom. (RNase compl.) (3/5,0.028) 


e20 


BP 


25 


cell communication (5/18,0.0147) 




vacuole (7/26,0.0014) 


Rep : 


MF 





signal transduction (4/12,0.0359) 




storage vacuole (5/20,0.0288) 


Ind : 38 


CC 


22 






lytic vacuole (5/20,0.0288) 












vacuole (sensu Fungi) (5/20,0.0288) 



34 





Coh.'' 


BP Enriched annot/ 


MF Enriched annot.'* 


CC Enriched annot." 


c6 

Rep : 3 
Ind : 36 


BP : 
MF : 33 
CC : 7 


- 


oxidoreductase activity (6/37,0.0255) 


peroxisomal matrix (2/2,0.0353) 


c8 

Rop : 3 
Ind : 22 


BP : n 
MF : 
CC : 








cll 

Rep : 
Ind : 34 


BP : 63 
MF : 58 
CC : 100 


carbohydrate biosynthesis (5/8,0.00) 
gluconeogenesis (4/5,0.000) 
hexose biosynthesis (4/5,0.0002) 
alcohol biosynthesis (4/5,0.0002) 
monosaccharide biosynthesis (4/5,0.0002) 
regulation of carbohydrate metabolis (3/3,0.0015) 
regulation of gluconeogenesis (3/3,0.0015) 
regulation of biosynthesis (3/3,0.0015) 
negative regulation of biosynthesis (3/3,0.0015) 
negative regulation of metabolism (3/3,0.0015) 
negative regulation of gluconeogenes (3/3,0,0015) 
negative regulation of carbohydrate (3/3,0.0015) 
energy pathways (6/25,0.0015) 

energy derivation by oxidation of or (6/25,0.0015) 
protein amino acid phosphorylation (4/8,0.0021) 
phosphorylation (4/9,0.0037) 

main pathways of carbohydrate metabo (4/11,0.0093) 

phosphorus metabolism (4/11,0.0093) 

phosphate metabolism (4/11,0.0093) 

carbohydrate metabolism (6/35,0.0120) 

glucose metabolism (4/12,0.0138) 

regulation of metabolism (3/6,0.0284) 

hexose metabolism (4/15,0.0363) 


protein kinase activity (5/12,0.00) 
phosphotransfer. activ., alcohol (5/19,0.000) 
kinase activity (5/27,0.0027) 

protein serine/threonine kinase acti (3/6,0.0035) 
cyclic-nucleotide dependent protein (2/2,0.0101) 
cAMP-depend prot kinase activi (2/2,0.01) 
transferase activity (7/103,0.0496) 


lipid particle (3/5,0.0029) 

cAMP-depend. prot. kinase compl. (2/2,0.012) 
cytoplasm (20/525,0.0139) 

- 


cl3 

Rep : 16 
Ind : 


BP : 40 
MF : 20 
CC : 


pyruvate dehydrogenase bypass (3/3,0.0002) 
pyruvate metabolism (3/3,0.0002) 
fermentation (2/2,0.0189) 
ethanol fermentation (2/2,0.0189) 
glycolytic fermentation (2/2,0.0189) 
alcohol metabolism (4/25,0.0303) 


pyruvate decarboxylase activity (2/3,0.0264) 

- 


- 


cl 

Rep : 13 
Ind : 


BP : 73 
MF : 36 
CC : 20 


translational elongation (4/10,0.0004) 

nascent polypeptide association (2/2,0.0096) 

methionine metabolism (2/2,0.0096) 

serine family amino acid metabolism (2/3,0.0286) 


transl. elong. fact, activi (3/7,0.003) 
translation factor activity, nucleic (4/29,0,0253) 
translation regulator activity (4/31,0.0328) 


nascent polypept.-associat. compl. (2/2,0.002) 


cl7 

Rep : 7 
Ind : 3 


BP : 80 
MF : 33 
CC : 25 


S phase of mitotic cell cycle (4/10,0.00) 
DNA replication (4/10,0.00) 

DNA replication and chromosome cycle (4/14,0.00) 

mitotic cell cycle (4/14,0.00) 

cell cycle (4/22,0,0002) 

DNA metabolism (4/24,0.0003) 

DNA dependent DNA replication (3/8,0.0004) 

cell proliferation (4/27,0.0004) 

DNA replication, priming (2/2,0.0016) 

DNA replication initiation (2/3,0.0048) 

lagging strand elongation (2/3,0.0048) 

DNA strand elongation (2/4,0.0095) 

DNA repair (2/8,0.0438) 


a DNA polymer, activ. (2/2,0.003) 
DNA-direct DNA polymer activ (2/4,0.015) 

- 


a DNA polymer. iprimase complex (2/2,0.0010) 
replication fork (2/4,0.0060) 

- 


c5 

Rep : 6 
Ind : 26 


BP ; 24 
MF : 45 
CC : 11 


regulation of redox homeostasis (3/5,0.0358) 
cell redox homeostasis (3/5,0.0358) 

oxygen and reactive oxygen species m (4/12,0.0455) 


oxidoreductase activity (9/37,0.00) 


mitochondr. intermembr. space (3/6,0.017) 
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Coh.'' 


BP Enriched annot.'' 


MF Enriched annot.'' 


CC Enriched annot.'' 


c4 


BP 


62 


carbohydrate metabolism (11/35,0.00) 


catalytic activity (24/280,0.0008) 


a, a-trehalose-phosphate synt (3/3,0.005) 


Rep : 1 


MF 


89 


response to stress (10/50,0.0050) 


hydrolase activity, acting on carbon (3/4,0.0201) 




Ind : 46 


CC 


7 








cl9 


BP 


45 


response to stress (14/50,0.00) 


heat shock protein activity (6/8,0.00) 


- 


Rep : 


MF 


48 


- 


oxidoreductase activity (9/37,0.0031) 


- 


Ind : 49 


CC 





- 


chaperone activity (6/19,0.0139) 


- 


c7 


BP 


89 


nucleobase, nucleoKide, nucleotide a (46/188,0.00) 


- 


nucleus (83/288,0.00) 


Rep : 114 


MF 





transcription, DN A-dependent (35/117,0.00) 


- 


nucleolus (40/118,0.00) 


Ind : 9 


CC 


78 


transcription (35/119,0.00) 


- 


nucleoplasm (13/34,0.0125) 








RNA metabolism (33/119,0.00) 


- 


DNA-direct. RNA polymer. Ill comp (7/13,0.03) 








RNA processing- (31/110,0.00) 


- 


- 








ribosome biogenesis (30/110,0,0001) 


- 


- 








transcription from Pol I promoter (28/102,0.0002) 


- 


- 








rRNA processing (25/90,0.0008) 












ribosome biogenesis and assembly (32/136,0.0013) 


_ 


_ 








cellular process (48/263,0.0037) 












cell growth and/or maintenance (47/256,0.0044) 












cytoplasm organization and biogenesi (35/172,0.0142) 


- 


- 








transcription from Pol III promoter (7/13,0.0392) 


- 


- 








cell organization and biogenesis (37/196,0.0479) 


- 


- 


cl6 


BP 


96 


ribosome biogenesis (48/110,0.00) 


snoRNA binding (14/25,0.00) 


nucleolus (49/118,0.00) 


Rep : 87 


MF 


77 


ribosome biogenesis and assembly (51/136,0.00) 


RNA binding (21/70,0.00) 


nucleus (68/288,0.00) 


Ind : 1 


CC 


88 


RNA processing (43/110,0.00) 


methyltransfer. activ. (10/20,0.000) 


small nucleolar ribonucleoprot. co (11/27,0.001) 








RNA metabolism (44/119,0.00) 


transferase activ., transferr. o (10/20,0.000) 


- 








cytoplasm organization and biogenesi (52/172,0.00) 


R.NA iiclicasc activity (9/17,0.0004) 


- 








transcription from Pol I promoter (40/102,0.00) 


ATP depend. RNA iiclic. activ. (S/14, 0.001) 


- 








rRNA processing (37/90,0.00) 


ATP depend, helic. activ. (8/14,0.001) 


- 








transcription, DNA-dependent (41/117,0.00) 


RNA depend. ATPase activ. (8/14,0.001) 


- 








transcription (41/119,0.00) 


hclicaso activity (9/18,0.0008) 


- 








nucleobase, nucleoside, nucleotide a (51/188,0.00) 


RNA mctliyltransfer. activ. (7/11,0.001) 










cell organization and biogenesis (52/196,0.00) 


nucleic acid binding (2.3/109,0.0041) 


: 








cellular process (55/263,0.00) 


binding (25/129,0.0080) 










cell growth and/or maintenance (54/256,0.00) 


S-adenosylmethion. -depend, methy (6/13,0.048) 


: 








processing of 20S pre-rRNA (16/37,0.0001) 


- 


- 








ribosomal large subunit biogenesis (9/13,0.0002) 


- 


- 








ribosomal large subunit assembly and (10/26,0.0380) 


- 


- 


c9 


BP 


62 


RNA metabolism (17/119,0.0025) 


binding (19/129,0.0006) 


nucleolus (18/118,0.0336) 


Rep : 48 


MF 


58 


RNA processing (16/110,0.0039) 


nucleic acid binding (15/109,0.0251) 




Ind : 8 


CC 


33 


nucleobase, nucleoside, nucleotide a (21/188,0.0083) 


- 


- 








transcription, DNA-dependent (15/117,0.0369) 


- 


- 








transcription (15/119,0.0451) 


- 


- 


cl8 


BP 


100 


protein biosynthesis (112/189,0.00) 


struct. COnstit. of riboso. (110/127,0.00) 


cytos. ribos. (sensu Euka.) (110/132,0.00) 


Rep : 122 


MF 


98 


macromolcculc biosynthesis (112/189,0.00) 


Struct, molec. activ. (110/128,0.00) 


ribosome (110/153,0.00) 


Ind : 


CC 


99 


biosynthesis (112/236,0.00) 


- 


cytosol (110/165,0.00) 








protein metabolism (112/252,0.00) 


- 


ribonucleoprotein complex (110/186,0.00) 








metabolism (112/523,0.00) 




cytosol. large ribos. subunit (62/75,0.00) 








ribosomal small subunit assembly and (8/10,0.0013) 




large ribosomal subunit (62/75,0.000) 








regulation of translational fidelity (6/7,0.0082) 




cytosol. small ribos. subunit (48/56,0.000) 








regulation of translation (8/12,0.0107) 




small ribosomal subunit (48/56,0.00) 








ribosomal subunit assembly (15/36,0.0263) 




cukaryotic 48S initiation complex (48/56,0.000) 












eukaryotic 43S preinitiation complex (49/61,0.00) 












cytoplasm (113/525,0.00) 
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TABLE XI Coherence results for the SP500 data with respect to the GIGS companies' annotations with Nc = 20 clusters. 
"Glustering algorithm. In the ( i^-means ) row we present the average results of all the six i^-means variants. For each of 
these variants we performed 100 runs from which the best solution is chosen. In the ( Hier. ) row we present the average 
results of all the 12 Hierarchical clustering variants. ^Correlation measure used by the algorithm. PC stands for the (centered) 
Pearson Correlation. \PC\ is the absolute value of this correlation. Euclidean stands for the Euclidean distance. "^Number of 
clusters with a positive coherence. Average coherence of all 20 clusters. 



iVc = 20 
Algorithm " 


Similarity 




{Coh) 


Iclust 


mutual information 


20 


86 


K-me&ns 


PC 


19 


74 


AT-means 


|PC| 


17 


69 


JsT-means 


Euclidean 


15 


58 


/f-medians 


PC 


20 


85 


JT-medians 


|PC| 


20 


88 


Jf-medians 


Euclidean 


20 


81 


( K means ) 




18. .5 




Hier - Gomp. linkage 


PC 


16 


70 


Hicr - Comp. linkage 


|PC| 


16 


70 


Hier - Comp. linkage 


Euclidean 


4 


12 


Hier - Avg. linkage 


PC 


7 


32 


Hier - Avg. linkage 


|PC| 


7 


32 


Hier - Avg. linkage 


Euclidean 








Hier - Centr. linkage 


PC 


2 


10 


Hier - Centr. linkage 


|PC| 


2 


10 


Hier - Centr. linkage 


Euclidean 








Hier - Sing, linkage 


PC 


2 


10 


Hier - Sing, linkage 


|PC| 


2 


10 


Hi(>r - Sing, liiikagx' 


Eucli(l(>aii 








( Hierarchical ) 




4.8 


21.3 
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TABLE XII Coherence results for the SP500 data with respect to the GICS companies' annotations with Nc — 15 clusters. 
The column and row definitions are as in Table f^Tl 



Nc = 15 
Algorithm " 


Similarity '' 




{Coh) 


Iclust 


mutual information 


15 


93 


JT-means 


PC 


12 


69 


/S'-means 


|PC 


13 


68 


/S'-means 


Euclidean 


11 


54 


/S'-medians 


PC 


15 


90 


/S'-medians 


PC 


15 


88 


/S'-medians 


Euclidean 


15 


85 


{ if-means ) 




13.5 


75.7 


Hier - Comp. linkage 


PC 


11 


63 


Hier - Comp. linkage 


|PC| 


11 


63 


Hier - Comp. linkage 


Euclidean 


2 


5 


Hier - Avg. linkage 


PC 


6 


32 


Hier - Avg. linkage 


|PC| 


6 


32 


Hier - Avg. linkage 


Euclidean 








Hier - Centr. linkage 


PC 


1 


7 


Hier - Centr. linkage 


|PC| 


1 


7 


Hier - Centr. linkage 


Euclidean 








Hier - Sing, linkage 


PC 


1 


7 


Hier - Sing, linkage 


|PC| 


1 


7 


Hier - Sing, linkage 


Euclidean 








( Hierarchical ) 




3.3 


18.6 
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TABLE XIII Coherence results for the SP500 data with respect to the GICS companies' annotations with = 10 clusters. 
The column and row definitions are as in Table f^Tl 



Nc = 10 
Algorithm " 


Similarity ^ 




{Coh) 


Iclust 


mutual information 


10 


91 


JT-means 


PC 


10 


84 


/S'-means 


|PC 


10 


85 


/S'-means 


Euclidean 


8 


63 


/S'-medians 


PC 


10 


90 


/S'-medians 


PC 


10 


90 


/S'-medians 


Euclidean 


10 


77 


{ K-me&ns ) 




9.7 


81.5 


Hier - Comp. linkage 


PC 


8 


64 


Hier - Comp. linkage 


|PC| 


8 


64 


Hier - Comp. linkage 


Euclidean 


4 


22 


Hier - Avg. linkage 


PC 


2 


20 


Hier - Avg. linkage 


|PC| 


2 


20 


Hier - Avg. linkage 


Euclidean 








Hier - Centr. linkage 


PC 


1 


10 


Hier - Centr. linkage 


|PC| 


1 


10 


Hier - Centr. linkage 


Euclidean 








Hier - Sing, linkage 


PC 








Hier - Sing, linkage 


|PC| 








Hier - Sing, linkage 


Euclidean 








( Hierarchical ) 




2.2 


17.5 
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TABLE XIV Coherence results for the SP500 data with respect to the GICS companies' annotations with Nc = 5 clusters. 
The column and row definitions are as in Table f^Tl 



iVc = 5 
Algorithm " 


Similarity '' 




{Coh) 


Iclust 


mutual information 


5 


88 


JT-means 


PC 


5 


90 


/S'-means 


|PC 


5 


87 


/S'-means 


Euclidean 


4 


54 


/S'-medians 


PC 


5 


90 


/S'-medians 


PC 


5 


92 


/S'-medians 


Euclidean 


5 


84 


{ if-means ) 




4.8 


82.8 


Hier - Comp. linkage 


PC 


4 


66 


Hier - Comp. linkage 


|PC| 


5 


84 


Hier - Comp. linkage 


Euclidean 


3 


36 


Hier - Avg. linkage 


PC 


1 


20 


Hier - Avg. linkage 


|PC| 


1 


20 


Hier - Avg. linkage 


Euclidean 








Hier - Centr. linkage 


PC 








Hier - Centr. linkage 


|PC| 








Hier - Centr. linkage 


Euclidean 








Hier - Sing, linkage 


PC 








Hier - Sing, linkage 


|PC| 








Hier - Sing, linkage 


Euclidean 








( Hierarchical ) 




1.2 


18.8 
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TABLE XV Enriched GICS annotations in the Iclust solution with Nc — 20 clusters for the SP500 data. Clusters are ordered 
as in Figure |H] Figure |U] and Figure IIUI Only annotations with a P-value below 0.05 (Bonferroni corrected) are presented. 
"Cluster index. '^Cluster size. '^Cluster coherence (in percentage). '^Enriched annotations. In parentheses: {x/K,p) stands for 
the number of companies in the cluster to which this annotation is assigned, the number of companies in the entire data to 
which this annotation is assigned, and the Bonferroni corrected P-value, respectively. 





C size'' 


Coh.'^ 


Enriched annot."^ 


cll 


18 


100 


4530 Scmiconductors& Semiconductor Equipment (16/19,0.000000) 
453010 Semiconductor^ Semiconductor Equipment (16/19,0.000000) 
45301020 Semiconductors (12/15,0.000000) 
45 Information Technology (18/81,0.000000) 
45301010 Semiconductor Equipment (4/4,0.000013) 


c9 


20 


95 


45 Information Technology (19/81,0.000000) 

4520 Technology Hardware^ Equipment (10/35,0.000002) 

451030 Software (6/15,0.000155) 

452030 Electronic Equipment^: Instruments (5/10,0.000276) 

4510 Software^ Services (7/27,0.000617) 

452010 Communications Equipment (5/14,0.001970) 

45201020 Communications Equipment (5/14.0.001970) 

45103010 Application Software (4/8,0.002368) 

45203020 Electronic Manufacturing Services (3/4,0.004177) 


cl2 


21 


95 


4520 Technology Hardware^ Equipment (13/35,0.000000) 
45 Information Technology (17/81,0.000000) 
45202010 Computer Hardware (5/7,0.000035) 
452010 Communications Equipment (6/14,0.000132) 
45201020 Communications Equipment (6/14,0.000132) 
452020 Computers^ Peripherals (5/10,0.000383) 
501020 Wireless Telecommunication Services (2/2,0.040138) 
50102010 Wireless Telecommunication Services (2/2,0.040138) 


clO 


20 


65 


2510 Automobiles^ Components (6/9,0.000005) 
251010 Auto Components (4/6,0.000863) 
201010 Aerospace& Defense (4/9,0.006685) 
20101010 Aerospace^; Defense (4/9,0.006685) 
25101010 Auto Parts& Equipment (3/4,0.006730) 
2010 Capital Goods (7/37,0.008990) 

25102010 Automobile Manufacturers (2/2,0.046560) 


cl6 


10 


30 


25301020 Hotels Resorts^ Cruise Lines (3/4,0.000546) 
2530 Hotels Restaurants^ Leisure (3/11,0.020860) 
253010 Hotels Restaurants^: Leisure (3/11,0.020860) 


cl8 


176 


19 


351010 Health Care Equipment^ Supplies (13/13,0.000133) 
35101010 Health Care Equipment (11/11,0.001186) 
2020 Commercial Services^ Supplies (11/12.0.009850) 
202010 Commercial Services^ Supplies (11/12.0.009850) 
2030 Transportation (9/9,0.010371) 


c3 


17 


83 


351020 Health Care Providersfc Services (9/16,0.000000) 
3510 Health Care Equipments Services (9/29,0.000000) 
35 Health Care (10/47,0.000000) 
35102030 Managed Health Care (4/5,0.000015) 
35102015 Health Care Services (2/4,0.045569) 
35102020 Health Care Facilities (2/4,0.045569) 
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C size'' 


Coh.= 


Enriched annot.*^ 


cl4 


IS 


94 


352020 Pharmaceuticals (10/13,0.000000) 

35202010 Pharmaceuticals (10/13,0.000000) 

3520 Pharmaceuticals^; Biotechnology (10/18,0.000000) 

35 Health Care (12/47,0.000000) 

501010 Diversified Telecommunication Services (5/9,0.000066) 
50101020 Integrated Telecommunication Services (5/9,0.000066) 
50 Telecommunication Services (5/11,0.000232) 
5010 Telecommunication Services (5/11,0.000232) 


c5 


18 


94 


30 Consumer Staples (17/35,0,000000) 

3020 Food Bcvcrago& Tobacco (12/19,0.000000) 

302020 Food Products (9/10,0,000000) 

30202030 Packaged Foods& Meats (8/9,0.000000) 

3030 Household& Per.sonal Products (4/6,0.000341) 

303010 Household Products (3/4,0.003000) 

30301010 Household Products (3/4,0.003000) 

302010 Beverages (3/6,0.014308) 


cl5 


9 


83 


4040 Real Estate (5/6,0.000000) 
404010 Real Estate (5/6,0.000000) 

40401010 Real Estate Investment Trusts (5/6,0.000000) 
40 Financials (5/80,0.004491) 


c2 


15 


93 


2540 Media (10/14,0,000000) 

254010 Media (10/14,0.000000) 

25401040 Publishing (7/7,0.000000) 

25 Consumer Discretionary (14/83,0.000000) 

25401020 Broadcasting^ Cable TV (2/3,0.031371) 

252010 Household Durables (3/11,0.040516) 


cl7 


23 


100 


2550 Retailing (19/30,0.000000) 

25 Consumer Discretionary (21/83,0.000000) 

255030 Multiline Retail (9/11,0.000000) 

255040 Specialty Retail (10/17,0.000000) 

25503010 Department Stores (5/7,0.000050) 

25503020 General Merchandise Stores (4/4,0.000065) 

25504010 Apparel Retail (3/3.0.001574) 

25504040 Specialty Stores (4/8,0.004005) 

30101040 HyperMarkets&: Super Centers (2/2,0.036344) 


c4 


19 


100 


55 Utilities (19/36,0.000000) 

5510 Utilities (19/36,0.000000) 

551010 Electric Utilities (14/22,0.000000) 

55101010 Electric Utilities (14/22,0.000000) 

551020 Gas Utilities (3/6,0.007516) 

55102010 Gas Utilities (3/6,0.007516) 


cl3 


23 


100 


4030 Insurance (19/21,0.000000) 
403010 Insurance (19/21,0.000000) 
40 Financials (23/80,0.000000) 

40301040 Property^: Casualty Insurance (9/9,0.000000) 
40301020 Life& Health Insurance (6/7,0.000000) 
40301030 Multi-line Insurance (3/3.0,001203) 
401020 Thrifts^ Mortgage Finance (3/6,0.021900) 
40102010 Thrifts^ Mortgage Finance (3/6,0.021900) 


c7 


15 


100 


4020 Diversified Financials (15/24,0.000000) 
402030 Capital Markets (13/16,0.000000) 
40 Financials (15/80,0.000000) 

40203020 Investment Banking^ Brokerage (6/7,0.000000) 
40203010 Asset Managements Custody Banks (6/8,0.000000) 


cl 


21 


100 


401010 Commercial Banks (21/23,0.000000) 

4010 Banks (21/29,0,000000) 

40101015 Regional Banks (16/17,0.000000) 

40 Financials (21/80,0.000000) 

40101010 Diversified Banks (5/6,0.000003) 
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C size^ 


Coh.'^ 


Enriched annot.'^ 


c8 


23 


83 


201060 Machinery (12/14,0.000000) 
2010 Capital Goods (16/37,0.000000) 
20 Industrials (16/58,0.000000) 

20106020 Industrial Machinery (7/9,0.000000) 

20106010 Construction^; Farm MachinerySi Heavy Trucks (5/5,0.000004) 
151050 Paper&i Forest Products (3/5,0.023469) 


cl9 


14 


93 


151010 ChcmicaLs (11/14,0.000000) 

15 Materials (13/33,0,000000) 

1510 Materials (13/33,0.000000) 

15101020 Diversified ChemieaLs (5/6,0.000001) 

15101050 Specialty Chemicals (4/5,0.000026) 

15101040 Industrial Gases (2/2,0.009228) 

15103020 Paper Packaging (2/3,0.027226) 


c6 


13 


100 


10 Energy (13/23,0.000000) 
1010 Energy (13/23,0,000000) 

101010 Energy Equipment& Services (7/7,0.000000) 
10102020 Oil& Gas Exploration^ Production (6/7,0.000000) 
10101020 Oil& Gas Equipment^ Services (4/4,0.000002) 
101020 Oil& Gas (6/16,0.000005) 
10101010 Oil& Gas Drilling (3/3,0.000105) 


c20 


8 


100 


101020 Oil& Gas (8/16,0.000000) 
10 Energy (8/23,0.000000) 
1010 Energy (8/23,0,000000) 

10102010 Integrated OilSz Gas (5/6,0.000000) 

10102030 Oil& Gas Refining^: Marketing^: Transportation (2/3,0.004224) 
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TABLE XVI Coherence results for the EachMovie data with respect to the movie genre annotations with Nc = 20 clusters. 
"Clustering algorithm. In the ( Jf-means ) row we present the average results of all the six Jf-means variants. For each of 
these variants we performed 100 runs from which the best solution is chosen. In the ( Hier. ) row we present the average 
results of all the 12 Hierarchical clustering variants. ''Correlation measure used by the algorithm. PC stands for the (centered) 
Pearson Correlation. \PC\ is the absolute value of this correlation. Euclidean stands for the Euclidean distance. '^Number of 
clusters with a positive coherence. ''Average coherence of all 20 clusters. 



iVc = 20 
Algorithm " 


Similarity 




{Coh) 


Iclust 


mutual information 


15 


54 


K-me&ns 


PC 


1 


3 


AT-means 


|PC| 


2 


4 


JsT-means 


Euclidean 


5 


12 


/f-medians 


PC 


2 


5 


JT-medians 


|PC| 


4 


8 


Jf-medians 


Euclidean 


2 


6 


( K means ) 




2.7 




Hier - Comp. linkage 


PC 


17 


55 


Hier - Comp. linkage 


|PC| 


16 


51 


Hier - Comp. linkage 


Euclidean 


10 


34 


Hier - Avg. linkage 


PC 


12 


43 


Hier - Avg. linkage 


|PC| 


12 


43 


Hier - Avg. linkage 


Euclidean 


5 


19 


Hier - Centr. linkage 


PC 


4 


16 


Hier - Centr. linkage 


|PC| 


4 


16 


Hier - Centr. linkage 


Euclidean 


2 


8 


Hier - Sing, linkage 


PC 








Hier - Sing, linkage 


|PC| 








Hi(>r - Sing, liiikagx' 


Eucli(l(>aii 


1 


5 


( Hierarchical ) 




6.9 


24.2 
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TABLE XVII Coherence results for the EachMovie data with respect to the movie genre annotations with Nc — 15 clusters. 
The column and row definitions are as in Tabic PrVTl 



Nc = 15 
Algorithm " 


Similarity '' 




{Coh) 


Iclust 


mutual information 


11 


54 


JT-means 


PC 


1 


2 


/S'-means 


|PC 


1 


1 


/S'-means 


Euclidean 


2 


6 


/S'-medians 


PC 


2 


5 


/S'-medians 


PC 


1 


3 


/S'-medians 


Euclidean 


4 


14 


{ if-means ) 




21.8 


5.2 


Hier - Comp. linkage 


PC 


13 


54 


Hier - Comp. linkage 


|PC| 


11 


47 


Hier - Comp. linkage 


Euclidean 


6 


29 


Hier - Avg. linkage 


PC 


10 


47 


Hier - Avg. linkage 


|PC| 


10 


46 


Hier - Avg. linkage 


Euclidean 


3 


16 


Hier - Centr. linkage 


PC 


2 


8 


Hier - Centr. linkage 


|PC| 


2 


8 


Hier - Centr. linkage 


Euclidean 


1 


3 


Hier - Sing, linkage 


PC 








Hier - Sing, linkage 


|PC| 








Hier - Sing, linkage 


Euclidean 


1 


7 


( Hierarchical ) 




4.9 


22.1 
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TABLE XVIII Coherence results for the EachMovie data with respect to the movie genre annotations with = 10 clusters. 
The column and row definitions are as in Table nrVTl 



Nc = 10 
Algorithm " 


Similarity ^ 




{Coh) 


Iclust 


mutual information 


9 


57 


JT-means 


PC 


1 


3 


/S'-means 


|PC 


1 


6 


/S'-means 


Euclidean 


4 


20 


/S'-medians 


PC 


2 


6 


/S'-medians 


PC 


2 


6 


/S'-medians 


Euclidean 


6 


27 


{ K-me&ns ) 




2.7 


11.3 


Hier - Comp. linkage 


PC 


8 


43 


Hier - Comp. linkage 


|PC| 


8 


44 


Hier - Comp. linkage 


Euclidean 


5 


36 


Hier - Avg. linkage 


PC 


7 


43 


Hier - Avg. linkage 


|PC| 


8 


47 


Hier - Avg. linkage 


Euclidean 


2 


16 


Hier - Centr. linkage 


PC 


2 


12 


Hier - Centr. linkage 


|PC| 


2 


12 


Hier - Centr. linkage 


Euclidean 


1 


4 


Hier - Sing, linkage 


PC 








Hier - Sing, linkage 


|PC| 








Hier - Sing, linkage 


Euclidean 


1 


10 


( Hierarchical ) 




3.7 


22.3 
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TABLE XIX Coherence results for the EachMovie data with respect to the movie genre annotations with Nc — 5 clusters. The 
column and row definitions are as in Table IXVTI 



iVc = 5 
Algorithm " 


Similarity '' 




{Coh) 


Iclust 


mutual information 


5 


48 


JT-means 


PC 


2 


21 


/S'-means 


|PC 


2 


19 


/S'-means 


Euclidean 


3 


30 


/S'-medians 


PC 


4 


32 


/S'-medians 


PC 


2 


19 


/S'-medians 


Euclidean 


4 


37 


{ if-means ) 




2.8 


26.3 


Hier - Comp. linkage 


PC 


5 


51 


Hier - Comp. linkage 


|PC| 


5 


41 


Hier - Comp. linkage 


Euclidean 


4 


48 


Hier - Avg. linkage 


PC 


4 


46 


Hier - Avg. linkage 


|PC| 


4 


47 


Hier - Avg. linkage 


Euclidean 


2 


21 


Hier - Centr. linkage 


PC 


2 


23 


Hier - Centr. linkage 


|PC| 


2 


23 


Hier - Centr. linkage 


Euclidean 


1 


9 


Hier - Sing, linkage 


PC 








Hier - Sing, linkage 


|PC| 








Hier - Sing, linkage 


Euclidean 








( Hierarchical ) 




2.4 


25.8 



47 



TABLE XX Enriched genre annotations in the Iclust solution with Nc — 20 clusters for the EachMovie data. Clusters are 
ordered as in Figure 1121 Figure 1131 and Figure 1141 Only annotations with a P-value below 0.05 (Bonferroni corrected) are 
presented. "Cluster index. ''Cluster size. '^Cluster coherence (in percentage). '^Enriched annotations. In parentheses: (x/K,p) 
stands for the number of movies in the cluster to which this annotation is assigned, the number of movies in the entire data to 
which this annotation is assigned, and the Bonferroni corrected P-value, respectively. 





\^ size 




Enriched annot.'^ 


cl4 


10 







c2 


16 







cl9 


10 


50 


Art-Foreign (5/45,0.005254) 


cS 


10 


70 


Art-Foreign (7/45,0.000019) 


cl8 


22 





- 


c9 


31 


55 


Action (17/110,0.000281) 


c20 


155 


55 


Drama (68/160,0.001170) 
Romance (30/61,0.011591) 


cl 


19 


95 


Classic (10/44.0.000004) 
Drama (15/160,0.000214) 


c7 


24 


71 


Classic (10/44.0.000067) 
Action (13/110,0.003526) 


cl5 


18 


94 


Action (16/110,0.000000) 
Thriller (10/90,0.001778) 


c5 


32 


39 


Thriller (12/90,0.034492) 


cl3 


15 


40 


Horror (6/33,0.001412) 


c3 


20 







clO 


15 







c4 


27 


74 


Romance (12/61,0.000100) 
Comedy (17/149,0.001613) 


c6 


11 


100 


Comedy (11/149,0.000001) 


cl6 


16 


87 


Comedy (13/149,0.000012) 


cl7 


21 


76 


Action (16/110,0.000000) 


cll 


14 


71 


Family (10/67,0.000003) 


cl2 


14 


100 


Family (13/67,0.000000) 
Animation (8/25,0.000000) 
Classic (5/44,0.019004) 



